You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched in the issues and found nothing similar.
Motivation
Currently, we don't have any compression for Arrow Log. But Arrow format consumes more disk space than other formats, like Avro, because Arrow is designed for memory layout and aligns on 8 bits. We see Arrow file is 30% larger than Avro file for the same data set in some tests.
Solution
Fortunately, Apache Arrow community already introduced buffer compressions: https://arrow.apache.org/docs/format/Columnar.html#compression. This can reduce a lot for the networking cost and disk space, but still have column pruning ability. Arrow already supports LZ4 and ZSTD compression types.
More benchmark results should be attached for the compression benefits.
Anything else?
Currently, Fluss still need to support Java8, and that's why Fluss sticks on arrow-15.0 (the latest is v18.0). Arrow 15.0 Java client already introduced the CompressionCodec interfaces for ArrowStreamWriter, but doesn't natively support LZ4 and ZSTD compressions (implemented in later versions). But it should work if we learn the implementation by extending our own CompressionCodec.Factory: https://github.com/apache/arrow-java/commits/main/compression/src/main/java/org/apache/arrow/compression
Willingness to contribute
I'm willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Motivation
Currently, we don't have any compression for Arrow Log. But Arrow format consumes more disk space than other formats, like Avro, because Arrow is designed for memory layout and aligns on 8 bits. We see Arrow file is 30% larger than Avro file for the same data set in some tests.
Solution
Fortunately, Apache Arrow community already introduced buffer compressions: https://arrow.apache.org/docs/format/Columnar.html#compression. This can reduce a lot for the networking cost and disk space, but still have column pruning ability. Arrow already supports LZ4 and ZSTD compression types.
More benchmark results should be attached for the compression benefits.
Anything else?
Currently, Fluss still need to support Java8, and that's why Fluss sticks on arrow-15.0 (the latest is v18.0). Arrow 15.0 Java client already introduced the
CompressionCodec
interfaces forArrowStreamWriter
, but doesn't natively support LZ4 and ZSTD compressions (implemented in later versions). But it should work if we learn the implementation by extending our ownCompressionCodec.Factory
: https://github.com/apache/arrow-java/commits/main/compression/src/main/java/org/apache/arrow/compressionWillingness to contribute
The text was updated successfully, but these errors were encountered: