Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with external parsers - Design decision #7

Open
eternaleclipse opened this issue Jun 13, 2023 · 3 comments
Open

Integration with external parsers - Design decision #7

eternaleclipse opened this issue Jun 13, 2023 · 3 comments
Labels
design enhancement New feature or request

Comments

@eternaleclipse
Copy link
Collaborator

In real PDF applications (such as Adobe Reader, Firefox's PDF parser, Chromium's PDF parser etc) there are many behaviors that are actually more permissive than the standard.

When we initially discussed this, I thought using infer's checks for verifying file formats would not be enough, since it doesn't go deep enough into the file format but typically just checks the first magic bytes, so, we'll miss the deeper checks and get a bunch of false polyglots.

As it turns out, a lot of software out there does some kind of content sniffing. This means it scans the first (ex. 1024) bytes of the file for the magic bytes and starts parsing from there! For reference, see this entry from BGGP 2021:
https://github.com/netspooky/BGGP/tree/main/2021/fliermate/BGGP2021

This behavior opens up a lot of space for polyglots that are not possible according to the official file format standards.
If we want to support this, we may want to integrate the original parsers. I've started looking into Acrobat's PDFL library. Integrating with Chromium and Firefox parsers will probably be easy, maybe they even have their own rust wrapper (although we should verify if the browser is not doing extra stuff around it).

This is relevant to all parsers, and is a design decision, because essentially it means integrating with a large amount of external software. I think if we move in this direction, it is best to do so gradually. Another point to consider is we might wanna sandbox execution for these potentially vulnerable parsers (#6).

@konata-chan404 I'd like your opinion on this :)

@yael333
Copy link
Owner

yael333 commented Jun 13, 2023

Yea that is something that has been on my mind as well, glad we have similar ideas.
I would be very down to integrate more complex parsers, code environments and tools to make this project much more robust even if it costs us in complexity. I think ideally figuring out sandboxing with the current design, and seeing how external parsers fit in would be awesome :3

@yael333
Copy link
Owner

yael333 commented Jun 13, 2023

After a bit of thinking I conceptualized something that could work:

I think the detector model is alright but it's not exactly as robust as we need in order to integrate it with other projects. My idea was to split the architecture to Detectors and Validators - the former would be similar to what we have one, and the latter would test the file type against an external parser/program (hence validating the file as format X).

The Detectors would be by definition safe and not sandboxed as they should just statically check the file - they are in charge of figuring out what formats inhibit the file. While on the other hand Validators must be sandboxed and check the validity of the file with an external tool, they could also be extended to some sort of output checking present in BGPP2021.
I'm not sure how it would go with the original vision. I think the user will give the files and the detectors will try to detect the file formats, validate them and reward the user based on the results - while in some modes we could skip the detectors and check against validators directly (If user supplies file formats for accuracy, challenges such as polygot PDF + Elf etc....).

There seems to be some solutions for sandboxing in Rust and unix enviroments in general, and we could and should try to make detectors based on this crate.

@eternaleclipse
Copy link
Collaborator Author

eternaleclipse commented Jun 20, 2023

Sounds good!

I'm not sure how it would go with the original vision [..]

Hmm, maybe we can have different flags like -v and -d for using the different types of checkers.
IMO in an actual contest we would probably want as much real world stuff as possible, so ideally validators only.
Still detectors are useful for sanity checks, and also since they are relatively easy to add, they can be useful as placeholders, before we add a validator that uses real-world software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants