-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ZIP decompression using @tokenizer/inflate
#695
Conversation
This is a very good improvement and I suspect, as you pointed out earlier, it aligns more closely with the spec of OOXML docs over my PR (which I will delete) However, a few things I noticed in my testing:
I think this is due to the hard-coded 4096 buffer size in the deflate module - which means it loops through and grabs only 4k at a time. I tried testing this with a 17Mb ppdx file and it timed out after 5 minutes. Of course it appears I was able to fix this by basically downloading the whole thing. Which may or may not be desirable and I would like to hear your thoughts. Adding this right before going into
That might have been an oversight as you removed that try/catch that was in there before. The zip file I tested with threw one. This did not happen in the previous version. |
36ab77a
to
a20f112
Compare
The primary goal of this PR was to enhance the identification of OOXML files, not directly to address the performance issue mentioned in #688. However, I’ve made optimizations in @tokenizer/[email protected], increasing the buffer size to 256 kB and reusing the buffer. These changes should improve performance to some extent, though it may still not be optimal for all scenarios.
To significantly improve performance in cases where random access to the file is possible, leveraging that random access is key. I’m considering extending the tokenizer interface to support random access where the underlying medium allows it. That said, when working with S3, latency can become a bottleneck. Many small reads might actually perform worse than fewer large reads. Striking the right balance will depend on the specific use case and the nature of the data access patterns. |
a20f112
to
578c97f
Compare
@tokenizer/deflate
@tokenizer/inflate
@scottdoc, do you mind testing this branch another time and assess the speed parsing file utilizing @tokenizer/s3? It does now utilize the Central Directory section1 of the ZIP. Footnotes |
578c97f
to
505995b
Compare
I expect it now resolves #688 as well. Ready for review. |
Thanks for the update @Borewit First off, to test I worked out that I needed to use the latest version of @tokenizer/s3 - 0.5.1 - which I did. Prior to that I was getting an error I have a few files that I was testing. 3 zip files of sizes: 2.8k, 13MB and 400MB. I also had a pptx file, 17MB. Using
Oddly (at least to me) using the stream tokenizer was faster. Using
So it looks pretty good, except there still seems to be an error for the smallest zip file. This one is just a text file (.csv) that is then zipped. The others are zipped folders, if that makes any difference? For reference, those files were working with the old version (larger zips + pptx) but just taking much longer |
Thank you for your detailed review, @scottdoc!
That's correct! The random access capability is a recent addition I implemented across all dependent and related packages. This feature allows efficient reading of the central directory at the end of files. One major benefit was eliminating an old "hack" in music-metadata, which previously bypassed strtok3 to directly read ID3 and APE headers at the end of files 1. This update enabled strtok3 to fully abstract media access. Regarding the performance: chunked transfer seems to be faster for specific ZIP files, such as Open Office Documents, as we only extract the required file to determine the type. However, for generic ZIP files, the performance is slightly reduced since we must continue scanning for file signatures. I'll take another look to optimize chunked transfer, particularly for the 400 MB ZIP file. Could you please share a copy of the failing ZIP file so I can investigate the error further? Footnotes |
Ok, I'll attach a zip file that is failing. It's a very simple one. This is the error I see when using
And this is what I see if I use
|
I cannot reproduce the issue @scottdoc . |
Don't wast time on further testing @scottdoc, I will get back with a better implementation. |
No worries. Though to answer your question, yes, those are the versions of |
e0e30f8
to
fcf89fd
Compare
@scottdoc, do you mind to repeat the tests with @tokenizer/inflate 02.5? I hope it now parses that 400MB file, from the S3 cloud, in chunked transfer mode, much faster, |
Oh wow! Yeah, the |
Changes: - Extract mime-type from zipped `[Content_Types].xml` in Open Office XML (OOXML) documents - Add dependency `@tokenizer/inflate`, used to decompress from the tokenizer stream
fcf89fd
to
250553d
Compare
@sindresorhus how do you feel about moving formats which are based on ZIP, (except ZIP archive itself), to a file-type plugin module/project? If you support that, I will create an alternative PR, providing the same functionality as this PR, but then moving formats based on ZIP archive (except ZIP, as this is a simple signature check) to the external plugin. |
I prefer to keep it here. Some ZIP based formats are popular, so most consumers will need it anyways. We can consider moving out some obscure formats to a plugin in the future. |
Makes perfect sense to keep the most popular in file-type, but keeping the core lightweight also has advantages for some use cases. I just created an extension, plugin, for file-type, to support XML based formats: @file-type/xml, I had a similar solution for ZIP based formats in mind. , In theory, as I have certainly no plans to do this, something similar could be done with the Compound File Binary Format (the old Microsoft Office document formats). |
Changes:
Considerations:
Initially, I explored creating a separate file-type plugin to detect Open Office XML (OOXML) documents. However, since OOXML documents are ZIP archives, proper detection requires decompressing the archive, which introduces additional dependencies. These dependencies may not be desirable for users uninterested in detecting OOXML documents.
While it might make sense to separate all ZIP-based format detection into a dedicated module, this would require significant architectural changes. Specifically, consuming the underlying tokenizer stream while retaining compatibility with file-type would be challenging to implement as an independent plugin.
As a starting point, this PR integrates decompression directly into file-type to handle OOXML detection. Future refactoring can further explore the separation of ZIP-based format detection if needed.
As discussed and explained in #688, the consequence of detecting zipped based formats, is that this have huge impact in parsing ZIP files. The worst impact on generic ZIP files as these are scanned from the start to the end for potential sub file type indicators.
Improves parse speed, In case the underlying file access supports random access (Buffer, file, S3, HTTP RFC-7233 chunked access.
Related to: #683, #612, #689
Resolves: #688