Support for Apache Arrow data representations #436

fniephaus · 2024-10-23T06:58:34Z

TL;DR

Recently, many Python libraries have integrated Apache Arrow to leverage its high-performance and memory-efficient format for handling large datasets. Given the widespread adoption of Arrow in data science and big data ecosystems, we plan to add Apache Arrow support to GraalPy.

Goals

The primary goal of adding Apache Arrow support to GraalPy is to enhance interoperability and performance while working with libraries such as Pandas. List-like structures in GraalPy will be backed by the Apache Arrow format, allowing seamless integration with those libraries while achieving zero-copy data transfers. This will enable data to be passed between GraalPy and Pandas without duplicating memory, significantly boosting performance, especially for large datasets.

Another key goal is to facilitate full interoperability with the Java implementation of Apache Arrow. This will enable users to load data in Java, execute Python-based data analysis using Pandas, and return results to Java, all without any memory copies, ensuring smooth, high-performance cross-language workflows.

Lastly, by allocating memory off-heap, GraalPy can allocate byte[] beyond the ~2GB limitation of the JVM (new byte[Integer.MAX_VALUE]), making it capable of handling much larger datasets.

Non-Goals

Replacement of existing data structures. The goal is not to replace all existing data structures. Only specific use cases will benefit from this integration.
Memory optimization. The focus is not on optimizing memory usage or speeding up operations on the existing structures.
Addressing other JVM constraints. While the off-heap memory allocation helps bypass the JVM's 2GB limitation, addressing other JVM-related memory constraints is not in the scope of this integration.

The text was updated successfully, but these errors were encountered:

fniephaus added the enhancement New feature or request label Oct 23, 2024

fniephaus added this to the 24.2.0 Release (March 18, 2025) milestone Oct 23, 2024

fniephaus assigned horakivo Oct 23, 2024

fniephaus added this to GraalPy Roadmap Oct 23, 2024

fniephaus moved this to In Progress in GraalPy Roadmap Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Apache Arrow data representations #436

Support for Apache Arrow data representations #436

fniephaus commented Oct 23, 2024

Support for Apache Arrow data representations #436

Support for Apache Arrow data representations #436

Comments

fniephaus commented Oct 23, 2024

TL;DR

Goals

Non-Goals