Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ZIP decompression using @tokenizer/inflate #695

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

Borewit
Copy link
Collaborator

@Borewit Borewit commented Nov 25, 2024

Changes:

Considerations:

Initially, I explored creating a separate file-type plugin to detect Open Office XML (OOXML) documents. However, since OOXML documents are ZIP archives, proper detection requires decompressing the archive, which introduces additional dependencies. These dependencies may not be desirable for users uninterested in detecting OOXML documents.

While it might make sense to separate all ZIP-based format detection into a dedicated module, this would require significant architectural changes. Specifically, consuming the underlying tokenizer stream while retaining compatibility with file-type would be challenging to implement as an independent plugin.

As a starting point, this PR integrates decompression directly into file-type to handle OOXML detection. Future refactoring can further explore the separation of ZIP-based format detection if needed.

As discussed and explained in #688, the consequence of detecting zipped based formats, is that this have huge impact in parsing ZIP files. The worst impact on generic ZIP files as these are scanned from the start to the end for potential sub file type indicators.

Improves parse speed, In case the underlying file access supports random access (Buffer, file, S3, HTTP RFC-7233 chunked access.

Related to: #683, #612, #689

Resolves: #688

@Borewit Borewit added the improvement Improvement of existing functionality label Nov 25, 2024
@Borewit Borewit self-assigned this Nov 25, 2024
@Borewit Borewit requested a review from sindresorhus November 25, 2024 12:02
@scottdoc
Copy link

This is a very good improvement and I suspect, as you pointed out earlier, it aligns more closely with the spec of OOXML docs over my PR (which I will delete)

However, a few things I noticed in my testing:

  1. It is still very slow reading files off AWS S3 with @tokenizer/s3

I think this is due to the hard-coded 4096 buffer size in the deflate module - which means it loops through and grabs only 4k at a time. I tried testing this with a 17Mb ppdx file and it timed out after 5 minutes. Of course it appears [Content_Types].xml is right at the end in my case.

I was able to fix this by basically downloading the whole thing. Which may or may not be desirable and I would like to hear your thoughts.

Adding this right before going into await new ZipHandler() dropped it down to about 2-3 seconds:

await tokenizer.peekBuffer(new Uint8Array(tokenizer.fileInfo.size), {mayBeLess: true});
  1. It does not catch End-Of-FIle Errors.

That might have been an oversight as you removed that try/catch that was in there before. The zip file I tested with threw one. This did not happen in the previous version.

@Borewit Borewit force-pushed the support-for-zip-inflate branch from 36ab77a to a20f112 Compare November 26, 2024 16:35
@Borewit
Copy link
Collaborator Author

Borewit commented Nov 26, 2024

It is still very slow reading files off AWS S3 with @tokenizer/s3

The primary goal of this PR was to enhance the identification of OOXML files, not directly to address the performance issue mentioned in #688. However, I’ve made optimizations in @tokenizer/[email protected], increasing the buffer size to 256 kB and reusing the buffer. These changes should improve performance to some extent, though it may still not be optimal for all scenarios.

I was able to fix this by basically downloading the whole thing. Which may or may not be desirable and I would like to hear your thoughts.

To significantly improve performance in cases where random access to the file is possible, leveraging that random access is key. I’m considering extending the tokenizer interface to support random access where the underlying medium allows it.

That said, when working with S3, latency can become a bottleneck. Many small reads might actually perform worse than fewer large reads. Striking the right balance will depend on the specific use case and the nature of the data access patterns.

@Borewit Borewit marked this pull request as draft November 28, 2024 18:15
@Borewit Borewit force-pushed the support-for-zip-inflate branch from a20f112 to 578c97f Compare December 1, 2024 20:25
@Borewit Borewit changed the title Add support for ZIP decompression using @tokenizer/deflate Add support for ZIP decompression using @tokenizer/inflate Dec 1, 2024
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 1, 2024

@scottdoc, do you mind testing this branch another time and assess the speed parsing file utilizing @tokenizer/s3?

It does now utilize the Central Directory section1 of the ZIP.

Footnotes

  1. https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

@Borewit Borewit force-pushed the support-for-zip-inflate branch from 578c97f to 505995b Compare December 2, 2024 18:28
@Borewit Borewit marked this pull request as ready for review December 2, 2024 18:36
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 2, 2024

I expect it now resolves #688 as well. Ready for review.

@scottdoc
Copy link

scottdoc commented Dec 3, 2024

@scottdoc, do you mind testing this branch another time and assess the speed parsing file utilizing @tokenizer/s3?

Thanks for the update @Borewit

First off, to test I worked out that I needed to use the latest version of @tokenizer/s3 - 0.5.1 - which I did. Prior to that I was getting an error TypeError: this.tokenizer.supportsRandomAccess is not a function

I have a few files that I was testing. 3 zip files of sizes: 2.8k, 13MB and 400MB. I also had a pptx file, 17MB.

Using makeChunkedTokenizerFromS3:

  • 2.8k zip - error TypeError: Cannot read properties of undefined (reading 'dataDescriptor') - This was thrown in the inflate lib
  • 13Mb zip - success 9s
  • 400Mb zip - error - Timed out 60s
  • PPTX - success 12s

Oddly (at least to me) using the stream tokenizer was faster. Using makeStreamingTokenizerFromS3

  • 2.8k zip - error Error: End-Of-Stream
  • 13Mb - success 4s
  • 400Mb zip - success - 57s
  • PPTX - success 3s

So it looks pretty good, except there still seems to be an error for the smallest zip file. This one is just a text file (.csv) that is then zipped. The others are zipped folders, if that makes any difference?

For reference, those files were working with the old version (larger zips + pptx) but just taking much longer

@Borewit Borewit marked this pull request as draft December 3, 2024 19:01
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 4, 2024

Thank you for your detailed review, @scottdoc!

First off, to test I worked out that I needed to use the latest version of @tokenizer/s3 - 0.5.1 - which I did. Prior to that I was getting an error TypeError: this.tokenizer.supportsRandomAccess is not a function.

That's correct! The random access capability is a recent addition I implemented across all dependent and related packages. This feature allows efficient reading of the central directory at the end of files. One major benefit was eliminating an old "hack" in music-metadata, which previously bypassed strtok3 to directly read ID3 and APE headers at the end of files 1. This update enabled strtok3 to fully abstract media access.

Regarding the performance: chunked transfer seems to be faster for specific ZIP files, such as Open Office Documents, as we only extract the required file to determine the type. However, for generic ZIP files, the performance is slightly reduced since we must continue scanning for file signatures.

I'll take another look to optimize chunked transfer, particularly for the 400 MB ZIP file. Could you please share a copy of the failing ZIP file so I can investigate the error further?

Footnotes

  1. Utilize strtok3 random access reading Borewit/music-metadata#2298

@scottdoc
Copy link

scottdoc commented Dec 5, 2024

Ok, I'll attach a zip file that is failing. It's a very simple one.
somefile.csv.zip

This is the error I see when using makeStreamingTokenizerFromS3

Error: End-Of-Stream
 ❯ StreamReader.read node_modules/peek-readable/lib/AbstractStreamReader.js:27:19
 ❯ ReadStreamTokenizer.readBuffer node_modules/strtok3/lib/ReadStreamTokenizer.js:34:27
 ❯ ReadStreamTokenizer.readToken node_modules/strtok3/lib/AbstractTokenizer.js:32:21
 ❯ ZipHandler.unzip node_modules/@tokenizer/inflate/lib/index.js:151:40
 ❯ FileTypeParser.parse node_modules/file-type/core.js:443:4

And this is what I see if I use makeChunkedTokenizerFromS3:

TypeError: Cannot read properties of undefined (reading 'dataDescriptor')
 ❯ ZipHandler.unzip node_modules/@tokenizer/inflate/lib/index.js:102:27
 ❯ FileTypeParser.parse node_modules/file-type/core.js:443:4

@Borewit
Copy link
Collaborator Author

Borewit commented Dec 5, 2024

I cannot reproduce the issue @scottdoc .
Can you check the version of @tokenizer/inflate? (modules/@tokenizer/inflate/package.json) Should be 0.2.4.
strtok3 should be 10.0.1

@Borewit
Copy link
Collaborator Author

Borewit commented Dec 5, 2024

Don't wast time on further testing @scottdoc, I will get back with a better implementation.

@scottdoc
Copy link

scottdoc commented Dec 5, 2024

No worries. Though to answer your question, yes, those are the versions of @tokenizer/inflate and strtok3 in my node_modules

@Borewit Borewit force-pushed the support-for-zip-inflate branch from e0e30f8 to fcf89fd Compare December 6, 2024 17:00
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 6, 2024

@scottdoc, do you mind to repeat the tests with @tokenizer/inflate 02.5? I hope it now parses that 400MB file, from the S3 cloud, in chunked transfer mode, much faster,

@scottdoc
Copy link

scottdoc commented Dec 8, 2024

Oh wow! Yeah, the makeChunkedTokenizerFromS3 one worked great -- everything passed, super fast. Processed all my test files in under a second, including the 400Mb zip. Nice work!

@Borewit Borewit marked this pull request as ready for review December 9, 2024 07:41
Changes:
- Extract mime-type from zipped `[Content_Types].xml` in Open Office XML (OOXML) documents
- Add dependency `@tokenizer/inflate`, used to decompress from the tokenizer stream
@Borewit Borewit force-pushed the support-for-zip-inflate branch from fcf89fd to 250553d Compare December 9, 2024 16:29
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 17, 2024

@sindresorhus how do you feel about moving formats which are based on ZIP, (except ZIP archive itself), to a file-type plugin module/project?

If you support that, I will create an alternative PR, providing the same functionality as this PR, but then moving formats based on ZIP archive (except ZIP, as this is a simple signature check) to the external plugin.

@sindresorhus
Copy link
Owner

I prefer to keep it here. Some ZIP based formats are popular, so most consumers will need it anyways. We can consider moving out some obscure formats to a plugin in the future.

@sindresorhus sindresorhus merged commit 399b0f1 into main Dec 17, 2024
6 checks passed
@sindresorhus sindresorhus deleted the support-for-zip-inflate branch December 17, 2024 21:53
@Borewit
Copy link
Collaborator Author

Borewit commented Dec 18, 2024

I prefer to keep it here. Some ZIP based formats are popular, so most consumers will need it anyways. We can consider moving out some obscure formats to a plugin in the future.

Makes perfect sense to keep the most popular in file-type, but keeping the core lightweight also has advantages for some use cases.

I just created an extension, plugin, for file-type, to support XML based formats: @file-type/xml, I had a similar solution for ZIP based formats in mind. , In theory, as I have certainly no plans to do this, something similar could be done with the Compound File Binary Format (the old Microsoft Office document formats).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detecting a pptx taking a very long time
3 participants