-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: generate and use preset compression dictionaries #197
Comments
Interesting you should bring that up ... That idea, along with block deduplication (see issue #58) and others has been on the back-burner as improvements since at least 2009 ! 2009 was when the Squashfs kernel code was mainlined by myself into the upstream kernel. One of the results of the mainlining was a commitment to keep the filesystem layout stable. This was an understandable request. because between 2002 and 2009 Squashfs had undergone rapid development, producing a whole set of incompatible versions (five layout changes in 24 releases over 6 years). But never say never, and these improvements are still on my todo list. It all depends on amount of time, priorities and what else people want done. Squashfs is still self-funded and done as unpaid voluntary work, which limits the amount of work I can do. |
Understood, although I am not sure if this (or at least a baseline version of this) would really require changes to the filesystem layout. Maybe it would be a bit hackish, but I can kinda imagine ways to implement this that would be no more disruptive than adding support for a new compression algorithm (e.g. define a new compressor ID for "zstd + preset dictionary", and store the dictionary somewhere known like in the compressor options1, in a special hidden file, or directly as a dedicated data block2), that as we know is something that has been done since mainlining. Or are there plans for a future "5.0" version of the layout? (in which case this could be done properly, including support for per-block dictionaries...) Footnotes |
ZSTD has posted some pretty big performance number improvements in their readme when using dictionaries. For myself this would be interesting to see how much the compression ratio improves for compressing a read-only filesystem (Like a Linux rootfs), since the read-only performance delta between LZ4HC and ZSTD has prevented me from using ZSTD today. |
Some compression algorithms (zstd, lz4, lzo, deflate, ...) support using a custom preset dictionary to achieve higher compression ratios especially for "small" source data sizes.
Given a corpus of known source files (or source blocks), it is thus possible to generate one or more preset dictionaries, and then use these as preset dictionaries during compression and decompression. Depending on the number and size of the source blocks/files, as well as the size and number of dictionaries, it is more or less likely that the resulting compressed size (including the size of the resulting preset dictionaries, that must be available during decompression) will be smaller than if no preset dictionaries were used.
If my understanding of squashfs is correct, squashfs could achieve higher compression ratios, especially when using small block sizes, by (optionally) attempting to generate one or more preset dictionaries based on the blocks that the image is composed of, and then attempting to compress each block using one of the generated preset dictionaries. If the resulting compressed size (+ dictionary size) is smaller than the compressed size without dictionaries, and if the extra memory usage required during decompression is acceptable, then the image compressed using the preset dictionaries would be preferable.
Note that while above I mentioned generating "one or more" dictionaries, It would be likely easier to start with a single one. Generating multiple ones would require somehow grouping "similar" files together, generating a dedicated dictionary for each group, and the use different dictionaries for different files/blocks.
An important use case for this is to efficiently compress executables that are statically compiled, and especially those that embed large runtimes, e.g. Go or Java (native image). In these cases different executables are likely to contain large identical or very similar chunks, something that a custom dictionary would be quite effective at compressing.
The text was updated successfully, but these errors were encountered: