Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Polars dataframes #461

Open
DeflateAwning opened this issue May 13, 2024 · 2 comments
Open

Add support for Polars dataframes #461

DeflateAwning opened this issue May 13, 2024 · 2 comments

Comments

@DeflateAwning
Copy link

Polars dataframes are way faster, more memory efficient, and have a more ergonomic interface for transformations.

At some point, you may want to switch the backend to Polars. At least for now, I think it makes sense to make a function that returns the result as a Polars dataframe without first going through Pandas (assuming the dataframe's current construction technique allows for it).

Fantastic library though! Very excited to check it out further.

@ianepreston
Copy link
Owner

@DeflateAwning, I've been thinking about refactoring this project more significantly to make the dataframe layer an optional extension, with the core package only relying on querying the REST api and downloading files. I'm not a polars user but in my professional life I would benefit from this library reading directly from csv into a spark dataframe, and the alterations that would allow that would permit a polars extension as well. The first step to doing this is adding some deprecation warnings to the existing parts of the code base that depend on pandas and adding some pandas specific functions so that users of the existing setup can transition. I'm not sure when I'll have time to do all that, and I'll need to allow some adoption time to pass before I rip out features so this will not be a quick change, but I support the direction

@DeflateAwning
Copy link
Author

Awesome, exciting news with all that! Looking forward to seeing the direction this all goes!

Depending on what the API responses look like (e.g., if they're table partitions or similar), Polars would be a great choice to store the intermediate data in, and is a great tool for converting to on-disk csv/parquet/other for storage. It's way more lightweight than Spark, and it's way more performant than Pandas. It supports converting efficiently to each of those dataframe types also, which means solid inter-op with Pandas and Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants