Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Apache Arrow data representations #436

Open
fniephaus opened this issue Oct 23, 2024 · 0 comments
Open

Support for Apache Arrow data representations #436

fniephaus opened this issue Oct 23, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@fniephaus
Copy link
Member

TL;DR

Recently, many Python libraries have integrated Apache Arrow to leverage its high-performance and memory-efficient format for handling large datasets. Given the widespread adoption of Arrow in data science and big data ecosystems, we plan to add Apache Arrow support to GraalPy.

Goals

The primary goal of adding Apache Arrow support to GraalPy is to enhance interoperability and performance while working with libraries such as Pandas. List-like structures in GraalPy will be backed by the Apache Arrow format, allowing seamless integration with those libraries while achieving zero-copy data transfers. This will enable data to be passed between GraalPy and Pandas without duplicating memory, significantly boosting performance, especially for large datasets.

Another key goal is to facilitate full interoperability with the Java implementation of Apache Arrow. This will enable users to load data in Java, execute Python-based data analysis using Pandas, and return results to Java, all without any memory copies, ensuring smooth, high-performance cross-language workflows.

Lastly, by allocating memory off-heap, GraalPy can allocate byte[] beyond the ~2GB limitation of the JVM (new byte[Integer.MAX_VALUE]), making it capable of handling much larger datasets.

Non-Goals

  • Replacement of existing data structures. The goal is not to replace all existing data structures. Only specific use cases will benefit from this integration.
  • Memory optimization. The focus is not on optimizing memory usage or speeding up operations on the existing structures.
  • Addressing other JVM constraints. While the off-heap memory allocation helps bypass the JVM's 2GB limitation, addressing other JVM-related memory constraints is not in the scope of this integration.
@fniephaus fniephaus added the enhancement New feature or request label Oct 23, 2024
@fniephaus fniephaus moved this to In Progress in GraalPy Roadmap Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

2 participants