-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with external parsers - Design decision #7
Comments
Yea that is something that has been on my mind as well, glad we have similar ideas. |
After a bit of thinking I conceptualized something that could work: I think the detector model is alright but it's not exactly as robust as we need in order to integrate it with other projects. My idea was to split the architecture to Detectors and Validators - the former would be similar to what we have one, and the latter would test the file type against an external parser/program (hence validating the file as format X). The Detectors would be by definition safe and not sandboxed as they should just statically check the file - they are in charge of figuring out what formats inhibit the file. While on the other hand Validators must be sandboxed and check the validity of the file with an external tool, they could also be extended to some sort of output checking present in BGPP2021. There seems to be some solutions for sandboxing in Rust and unix enviroments in general, and we could and should try to make detectors based on this crate. |
Sounds good!
Hmm, maybe we can have different flags like |
In real PDF applications (such as Adobe Reader, Firefox's PDF parser, Chromium's PDF parser etc) there are many behaviors that are actually more permissive than the standard.
When we initially discussed this, I thought using
infer
's checks for verifying file formats would not be enough, since it doesn't go deep enough into the file format but typically just checks the first magic bytes, so, we'll miss the deeper checks and get a bunch of false polyglots.As it turns out, a lot of software out there does some kind of content sniffing. This means it scans the first (ex. 1024) bytes of the file for the magic bytes and starts parsing from there! For reference, see this entry from BGGP 2021:
https://github.com/netspooky/BGGP/tree/main/2021/fliermate/BGGP2021
This behavior opens up a lot of space for polyglots that are not possible according to the official file format standards.
If we want to support this, we may want to integrate the original parsers. I've started looking into Acrobat's PDFL library. Integrating with Chromium and Firefox parsers will probably be easy, maybe they even have their own rust wrapper (although we should verify if the browser is not doing extra stuff around it).
This is relevant to all parsers, and is a design decision, because essentially it means integrating with a large amount of external software. I think if we move in this direction, it is best to do so gradually. Another point to consider is we might wanna sandbox execution for these potentially vulnerable parsers (#6).
@konata-chan404 I'd like your opinion on this :)
The text was updated successfully, but these errors were encountered: