Integration with external parsers - Design decision #7

eternaleclipse · 2023-06-13T13:23:15Z

In real PDF applications (such as Adobe Reader, Firefox's PDF parser, Chromium's PDF parser etc) there are many behaviors that are actually more permissive than the standard.

When we initially discussed this, I thought using infer's checks for verifying file formats would not be enough, since it doesn't go deep enough into the file format but typically just checks the first magic bytes, so, we'll miss the deeper checks and get a bunch of false polyglots.

As it turns out, a lot of software out there does some kind of content sniffing. This means it scans the first (ex. 1024) bytes of the file for the magic bytes and starts parsing from there! For reference, see this entry from BGGP 2021:
https://github.com/netspooky/BGGP/tree/main/2021/fliermate/BGGP2021

This behavior opens up a lot of space for polyglots that are not possible according to the official file format standards.
If we want to support this, we may want to integrate the original parsers. I've started looking into Acrobat's PDFL library. Integrating with Chromium and Firefox parsers will probably be easy, maybe they even have their own rust wrapper (although we should verify if the browser is not doing extra stuff around it).

This is relevant to all parsers, and is a design decision, because essentially it means integrating with a large amount of external software. I think if we move in this direction, it is best to do so gradually. Another point to consider is we might wanna sandbox execution for these potentially vulnerable parsers (#6).

@konata-chan404 I'd like your opinion on this :)

The text was updated successfully, but these errors were encountered:

yael333 · 2023-06-13T16:49:54Z

Yea that is something that has been on my mind as well, glad we have similar ideas.
I would be very down to integrate more complex parsers, code environments and tools to make this project much more robust even if it costs us in complexity. I think ideally figuring out sandboxing with the current design, and seeing how external parsers fit in would be awesome :3

yael333 · 2023-06-13T21:49:50Z

After a bit of thinking I conceptualized something that could work:

I think the detector model is alright but it's not exactly as robust as we need in order to integrate it with other projects. My idea was to split the architecture to Detectors and Validators - the former would be similar to what we have one, and the latter would test the file type against an external parser/program (hence validating the file as format X).

The Detectors would be by definition safe and not sandboxed as they should just statically check the file - they are in charge of figuring out what formats inhibit the file. While on the other hand Validators must be sandboxed and check the validity of the file with an external tool, they could also be extended to some sort of output checking present in BGPP2021.
I'm not sure how it would go with the original vision. I think the user will give the files and the detectors will try to detect the file formats, validate them and reward the user based on the results - while in some modes we could skip the detectors and check against validators directly (If user supplies file formats for accuracy, challenges such as polygot PDF + Elf etc....).

There seems to be some solutions for sandboxing in Rust and unix enviroments in general, and we could and should try to make detectors based on this crate.

eternaleclipse · 2023-06-20T15:30:49Z

Sounds good!

I'm not sure how it would go with the original vision [..]

Hmm, maybe we can have different flags like -v and -d for using the different types of checkers.
IMO in an actual contest we would probably want as much real world stuff as possible, so ideally validators only.
Still detectors are useful for sanity checks, and also since they are relatively easy to add, they can be useful as placeholders, before we add a validator that uses real-world software.

eternaleclipse closed this as completed Jun 13, 2023

eternaleclipse reopened this Jun 13, 2023

eternaleclipse added enhancement New feature or request design labels Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with external parsers - Design decision #7

Integration with external parsers - Design decision #7

eternaleclipse commented Jun 13, 2023

yael333 commented Jun 13, 2023

yael333 commented Jun 13, 2023

eternaleclipse commented Jun 20, 2023 •

edited

Loading

Integration with external parsers - Design decision #7

Integration with external parsers - Design decision #7

Comments

eternaleclipse commented Jun 13, 2023

yael333 commented Jun 13, 2023

yael333 commented Jun 13, 2023

eternaleclipse commented Jun 20, 2023 • edited Loading

eternaleclipse commented Jun 20, 2023 •

edited

Loading