Add support for ZIP decompression using `@tokenizer/inflate` #695

Borewit · 2024-11-25T12:01:44Z

Changes:

Extract MIME type from the [Content_Types].xml file in Open Office XML (OOXML) documents.
Add dependency: @tokenizer/inflate, used to decompress data from the tokenizer stream.

Considerations:

Initially, I explored creating a separate file-type plugin to detect Open Office XML (OOXML) documents. However, since OOXML documents are ZIP archives, proper detection requires decompressing the archive, which introduces additional dependencies. These dependencies may not be desirable for users uninterested in detecting OOXML documents.

While it might make sense to separate all ZIP-based format detection into a dedicated module, this would require significant architectural changes. Specifically, consuming the underlying tokenizer stream while retaining compatibility with file-type would be challenging to implement as an independent plugin.

As a starting point, this PR integrates decompression directly into file-type to handle OOXML detection. Future refactoring can further explore the separation of ZIP-based format detection if needed.

As discussed and explained in #688, the consequence of detecting zipped based formats, is that this have huge impact in parsing ZIP files. The worst impact on generic ZIP files as these are scanned from the start to the end for potential sub file type indicators.

Improves parse speed, In case the underlying file access supports random access (Buffer, file, S3, HTTP RFC-7233 chunked access.

Related to: #683, #612, #689

Resolves: #688

scottdoc · 2024-11-26T05:22:19Z

This is a very good improvement and I suspect, as you pointed out earlier, it aligns more closely with the spec of OOXML docs over my PR (which I will delete)

However, a few things I noticed in my testing:

It is still very slow reading files off AWS S3 with @tokenizer/s3

I think this is due to the hard-coded 4096 buffer size in the deflate module - which means it loops through and grabs only 4k at a time. I tried testing this with a 17Mb ppdx file and it timed out after 5 minutes. Of course it appears [Content_Types].xml is right at the end in my case.

I was able to fix this by basically downloading the whole thing. Which may or may not be desirable and I would like to hear your thoughts.

Adding this right before going into await new ZipHandler() dropped it down to about 2-3 seconds:

await tokenizer.peekBuffer(new Uint8Array(tokenizer.fileInfo.size), {mayBeLess: true});

It does not catch End-Of-FIle Errors.

That might have been an oversight as you removed that try/catch that was in there before. The zip file I tested with threw one. This did not happen in the previous version.

Borewit · 2024-11-26T16:48:23Z

It is still very slow reading files off AWS S3 with @tokenizer/s3

The primary goal of this PR was to enhance the identification of OOXML files, not directly to address the performance issue mentioned in #688. However, I’ve made optimizations in @tokenizer/[email protected], increasing the buffer size to 256 kB and reusing the buffer. These changes should improve performance to some extent, though it may still not be optimal for all scenarios.

I was able to fix this by basically downloading the whole thing. Which may or may not be desirable and I would like to hear your thoughts.

To significantly improve performance in cases where random access to the file is possible, leveraging that random access is key. I’m considering extending the tokenizer interface to support random access where the underlying medium allows it.

That said, when working with S3, latency can become a bottleneck. Many small reads might actually perform worse than fewer large reads. Striking the right balance will depend on the specific use case and the nature of the data access patterns.

Borewit · 2024-12-01T20:28:59Z

@scottdoc, do you mind testing this branch another time and assess the speed parsing file utilizing @tokenizer/s3?

It does now utilize the Central Directory section¹ of the ZIP.

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT ↩

Borewit · 2024-12-02T18:43:47Z

I expect it now resolves #688 as well. Ready for review.

scottdoc · 2024-12-03T05:20:49Z

@scottdoc, do you mind testing this branch another time and assess the speed parsing file utilizing @tokenizer/s3?

Thanks for the update @Borewit

First off, to test I worked out that I needed to use the latest version of @tokenizer/s3 - 0.5.1 - which I did. Prior to that I was getting an error TypeError: this.tokenizer.supportsRandomAccess is not a function

I have a few files that I was testing. 3 zip files of sizes: 2.8k, 13MB and 400MB. I also had a pptx file, 17MB.

Using makeChunkedTokenizerFromS3:

2.8k zip - error TypeError: Cannot read properties of undefined (reading 'dataDescriptor') - This was thrown in the inflate lib
13Mb zip - success 9s
400Mb zip - error - Timed out 60s
PPTX - success 12s

Oddly (at least to me) using the stream tokenizer was faster. Using makeStreamingTokenizerFromS3

2.8k zip - error Error: End-Of-Stream
13Mb - success 4s
400Mb zip - success - 57s
PPTX - success 3s

So it looks pretty good, except there still seems to be an error for the smallest zip file. This one is just a text file (.csv) that is then zipped. The others are zipped folders, if that makes any difference?

For reference, those files were working with the old version (larger zips + pptx) but just taking much longer

Borewit · 2024-12-04T09:32:36Z

Thank you for your detailed review, @scottdoc!

First off, to test I worked out that I needed to use the latest version of @tokenizer/s3 - 0.5.1 - which I did. Prior to that I was getting an error TypeError: this.tokenizer.supportsRandomAccess is not a function.

That's correct! The random access capability is a recent addition I implemented across all dependent and related packages. This feature allows efficient reading of the central directory at the end of files. One major benefit was eliminating an old "hack" in music-metadata, which previously bypassed strtok3 to directly read ID3 and APE headers at the end of files ¹. This update enabled strtok3 to fully abstract media access.

Regarding the performance: chunked transfer seems to be faster for specific ZIP files, such as Open Office Documents, as we only extract the required file to determine the type. However, for generic ZIP files, the performance is slightly reduced since we must continue scanning for file signatures.

I'll take another look to optimize chunked transfer, particularly for the 400 MB ZIP file. Could you please share a copy of the failing ZIP file so I can investigate the error further?

Utilize strtok3 random access reading Borewit/music-metadata#2298 ↩

scottdoc · 2024-12-05T01:15:52Z

Ok, I'll attach a zip file that is failing. It's a very simple one.
somefile.csv.zip

This is the error I see when using makeStreamingTokenizerFromS3

Error: End-Of-Stream
 ❯ StreamReader.read node_modules/peek-readable/lib/AbstractStreamReader.js:27:19
 ❯ ReadStreamTokenizer.readBuffer node_modules/strtok3/lib/ReadStreamTokenizer.js:34:27
 ❯ ReadStreamTokenizer.readToken node_modules/strtok3/lib/AbstractTokenizer.js:32:21
 ❯ ZipHandler.unzip node_modules/@tokenizer/inflate/lib/index.js:151:40
 ❯ FileTypeParser.parse node_modules/file-type/core.js:443:4

And this is what I see if I use makeChunkedTokenizerFromS3:

TypeError: Cannot read properties of undefined (reading 'dataDescriptor')
 ❯ ZipHandler.unzip node_modules/@tokenizer/inflate/lib/index.js:102:27
 ❯ FileTypeParser.parse node_modules/file-type/core.js:443:4

Borewit · 2024-12-05T09:56:53Z

I cannot reproduce the issue @scottdoc .
Can you check the version of @tokenizer/inflate? (modules/@tokenizer/inflate/package.json) Should be 0.2.4.
strtok3 should be 10.0.1

Borewit · 2024-12-05T19:38:02Z

Don't wast time on further testing @scottdoc, I will get back with a better implementation.

scottdoc · 2024-12-05T23:18:11Z

No worries. Though to answer your question, yes, those are the versions of @tokenizer/inflate and strtok3 in my node_modules

Borewit · 2024-12-06T17:07:29Z

@scottdoc, do you mind to repeat the tests with @tokenizer/inflate 02.5? I hope it now parses that 400MB file, from the S3 cloud, in chunked transfer mode, much faster,

scottdoc · 2024-12-08T23:05:11Z

Oh wow! Yeah, the makeChunkedTokenizerFromS3 one worked great -- everything passed, super fast. Processed all my test files in under a second, including the 400Mb zip. Nice work!

Changes: - Extract mime-type from zipped `[Content_Types].xml` in Open Office XML (OOXML) documents - Add dependency `@tokenizer/inflate`, used to decompress from the tokenizer stream

Borewit · 2024-12-17T13:33:20Z

@sindresorhus how do you feel about moving formats which are based on ZIP, (except ZIP archive itself), to a file-type plugin module/project?

If you support that, I will create an alternative PR, providing the same functionality as this PR, but then moving formats based on ZIP archive (except ZIP, as this is a simple signature check) to the external plugin.

sindresorhus · 2024-12-17T13:45:55Z

I prefer to keep it here. Some ZIP based formats are popular, so most consumers will need it anyways. We can consider moving out some obscure formats to a plugin in the future.

Borewit · 2024-12-18T12:33:27Z

I prefer to keep it here. Some ZIP based formats are popular, so most consumers will need it anyways. We can consider moving out some obscure formats to a plugin in the future.

Makes perfect sense to keep the most popular in file-type, but keeping the core lightweight also has advantages for some use cases.

I just created an extension, plugin, for file-type, to support XML based formats: @file-type/xml, I had a similar solution for ZIP based formats in mind. , In theory, as I have certainly no plans to do this, something similar could be done with the Compound File Binary Format (the old Microsoft Office document formats).

Borewit added the improvement Improvement of existing functionality label Nov 25, 2024

Borewit self-assigned this Nov 25, 2024

Borewit requested a review from sindresorhus November 25, 2024 12:02

Borewit force-pushed the support-for-zip-inflate branch from 36ab77a to a20f112 Compare November 26, 2024 16:35

Borewit marked this pull request as draft November 28, 2024 18:15

Borewit force-pushed the support-for-zip-inflate branch from a20f112 to 578c97f Compare December 1, 2024 20:25

Borewit changed the title ~~Add support for ZIP decompression using @tokenizer/deflate~~ Add support for ZIP decompression using @tokenizer/inflate Dec 1, 2024

Borewit force-pushed the support-for-zip-inflate branch from 578c97f to 505995b Compare December 2, 2024 18:28

Borewit marked this pull request as ready for review December 2, 2024 18:36

sindresorhus force-pushed the main branch from 2bc2eb7 to 9506b67 Compare December 3, 2024 00:05

Borewit marked this pull request as draft December 3, 2024 19:01

Borewit force-pushed the support-for-zip-inflate branch from e0e30f8 to fcf89fd Compare December 6, 2024 17:00

Borewit marked this pull request as ready for review December 9, 2024 07:41

Add support for ZIP decompression using @tokenizer/deflate

250553d

Changes: - Extract mime-type from zipped `[Content_Types].xml` in Open Office XML (OOXML) documents - Add dependency `@tokenizer/inflate`, used to decompress from the tokenizer stream

Borewit force-pushed the support-for-zip-inflate branch from fcf89fd to 250553d Compare December 9, 2024 16:29

sindresorhus merged commit 399b0f1 into main Dec 17, 2024
6 checks passed

sindresorhus deleted the support-for-zip-inflate branch December 17, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ZIP decompression using `@tokenizer/inflate` #695

Add support for ZIP decompression using `@tokenizer/inflate` #695

Borewit commented Nov 25, 2024 •

edited

Loading

scottdoc commented Nov 26, 2024

Borewit commented Nov 26, 2024

Borewit commented Dec 1, 2024 •

edited

Loading

Borewit commented Dec 2, 2024

scottdoc commented Dec 3, 2024

Borewit commented Dec 4, 2024

scottdoc commented Dec 5, 2024

Borewit commented Dec 5, 2024

Borewit commented Dec 5, 2024

scottdoc commented Dec 5, 2024

Borewit commented Dec 6, 2024 •

edited

Loading

scottdoc commented Dec 8, 2024

Borewit commented Dec 17, 2024

sindresorhus commented Dec 17, 2024

Borewit commented Dec 18, 2024

Add support for ZIP decompression using @tokenizer/inflate #695

Add support for ZIP decompression using @tokenizer/inflate #695

Conversation

Borewit commented Nov 25, 2024 • edited Loading

Changes:

Considerations:

scottdoc commented Nov 26, 2024

Borewit commented Nov 26, 2024

Borewit commented Dec 1, 2024 • edited Loading

Footnotes

Borewit commented Dec 2, 2024

scottdoc commented Dec 3, 2024

Borewit commented Dec 4, 2024

Footnotes

scottdoc commented Dec 5, 2024

Borewit commented Dec 5, 2024

Borewit commented Dec 5, 2024

scottdoc commented Dec 5, 2024

Borewit commented Dec 6, 2024 • edited Loading

scottdoc commented Dec 8, 2024

Borewit commented Dec 17, 2024

sindresorhus commented Dec 17, 2024

Borewit commented Dec 18, 2024

Add support for ZIP decompression using `@tokenizer/inflate` #695

Add support for ZIP decompression using `@tokenizer/inflate` #695

Borewit commented Nov 25, 2024 •

edited

Loading

Borewit commented Dec 1, 2024 •

edited

Loading

Borewit commented Dec 6, 2024 •

edited

Loading