Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed ArrowRecordBatchAdapter to replace ParquetReader with read_from_memory_tables=True #297

Open
ptomecek opened this issue Jun 25, 2024 · 0 comments
Labels
adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter

Comments

@ptomecek
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting read_from_memory_tables=True), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.

Describe the solution you'd like
A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter.
An initial implementation could just delegate to the ParquetReader for implementation.

Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).

Describe alternatives you've considered
Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.

Additional context
Other features to consider

  • Being able to convert arrow lists as numpy arrays (like the current parquet reader)
  • Being able to convert arrow structs as csp structs
  • An option such at that each timestamp, it ticks an arrow record batches containing all the records with that timestamp (in cases where we want to minimize any copy overhead and just operate on arrow types directly). This is essentially a "group-by" on the timestamp column.

Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.

@ptomecek ptomecek added the adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapter: parquet Issues and PRs related to our Apache Parquet/Arrow adapter
Projects
None yet
Development

No branches or pull requests

1 participant