Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Arrow Log Supports Compressions #187

Open
1 of 2 tasks
wuchong opened this issue Dec 14, 2024 · 0 comments
Open
1 of 2 tasks

[Feature] Arrow Log Supports Compressions #187

wuchong opened this issue Dec 14, 2024 · 0 comments
Milestone

Comments

@wuchong
Copy link
Member

wuchong commented Dec 14, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Currently, we don't have any compression for Arrow Log. But Arrow format consumes more disk space than other formats, like Avro, because Arrow is designed for memory layout and aligns on 8 bits. We see Arrow file is 30% larger than Avro file for the same data set in some tests.

Solution

Fortunately, Apache Arrow community already introduced buffer compressions: https://arrow.apache.org/docs/format/Columnar.html#compression. This can reduce a lot for the networking cost and disk space, but still have column pruning ability. Arrow already supports LZ4 and ZSTD compression types.

More benchmark results should be attached for the compression benefits.

Anything else?

Currently, Fluss still need to support Java8, and that's why Fluss sticks on arrow-15.0 (the latest is v18.0). Arrow 15.0 Java client already introduced the CompressionCodec interfaces for ArrowStreamWriter, but doesn't natively support LZ4 and ZSTD compressions (implemented in later versions). But it should work if we learn the implementation by extending our own CompressionCodec.Factory: https://github.com/apache/arrow-java/commits/main/compression/src/main/java/org/apache/arrow/compression

Willingness to contribute

  • I'm willing to submit a PR!
@wuchong wuchong added the feature New feature or request label Dec 14, 2024
@wuchong wuchong added this to the v0.6 milestone Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant