-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: option efficient memory use by downcasting #275
Comments
Thanks for the suggestion @dkapitan Interesting ideas - what's the rationale for not doing that every time? I haven't used the new int dtype much; my understanding is that categoricals were generally dominant of numpy object strings, it's basically dictionary encoding Either way, I would definitely be keen to add these as options. Explicit > implicit, and I think that having the library behind the scenes send additional queries and self-optimize would be too much magic, at least to start. What are your thoughts? |
@max-sixty Enhancements
Design considerations and focus first version
Feel free to add/modify so we have a clear goal for the first iteration. Have started with the how-to for contributors, my fork is at https://github.com/dkapitan/pandas-gbq. Will let you know when I have done my first commit of the |
I agree. Especially since it's not supported by the BigQuery team to query anonymous query results tables. I propose we add a |
Oops, I should read more closely, I think I just proposed your user story. :-) |
Thanks for the clear plan @dkapitan (I've just returned from a conference so thanks in advance for your patience) Can we clarify whether the data in the data influences whether we'd want these - e.g. why not always use If this is something that doesn't depend on the underlying data, we could skip the work behind |
The blog post recommends using smaller types where possible. The way this is implemented in |
First working implementation with following additions: * `dtypes` option added to `read_gbq` including tests in `system.test_gbq.py` * `gbq.optimze_dtypes` methods for determining optimal pandas/numpy dtypes, including helper functions in `schema.py` * add `project_id=None` option in `GbqConnector.schema` to allow reflection of tables in other projects
So the good news is that I have got a working version, see this commit. Basic testing seems OK (haven't written all the tests yet), but when actually converting the query results to a dataframe I'm getting weird results. It seems that, depending on the size of the QueryJob.results, the conversion from STRING to category fails; it only works for small result sets. See attached .ipynb. I have tried to trace this bug in the source code, but can't find anything. Your help would be much appreciated. Then, regarding your questions:
Finally, on a practical note: I don't have much experience using Git with branches (only used basic Git stuff). I'm afraid I've made a mess of my branch already (too many merges and rebases whilst trying to figure out how it works). Please let me know if this OK (assuming it can be fixed during a PR). If not, please let me know and I will start with a clean fork. Curious to hear your thoughts. |
Probably because the downcasting happens for each page, so if there's a page without all the possible categories, the concat will fail. |
I can probably clean it up if you mail the PR and check the "allow edits from maintainers" box. Don't worry about it too much. |
Could you file an issue to |
Great @dkapitan My experience is that pandas doesn't deal that well with non-standard types (e.g. smaller floats are upcast on some operations). Combined with our current parsing, I'm not sure it'll save much memory, consistent with @tswast 's thoughts above. If there's any tradeoff in prioritization between @dkapitan suggestions around using 1) nullable ints and 2) categoricals for strings; those would be immediately impactful. |
💯 |
Thanks for all the feedback @max-sixty @tswast. I am not sure how to proceed from here. Below some extra tests, thoughts and questions. Please let me know what you think is best to do next. Persistence of downcasted types
Have done a simple test (see below), and for now I'd say we stick to the following:
Peak memory consumption memory usage with optimized dataframe (df_optimized 0.8 MB) memory usage with non-optimized dataframe (df 4.6 MB) Although
STRING -> category conversion |
Interesting: I have played around a bit more with via via via via `pd.read_parquet(), using the gzip-imported dataframe -> exported as parquet as sourcefile Note that:
So following Wes McKinney's long term plan, using parquet as columnar data middleware seems the way to go. That all depends on adding parquet export function to BigQuery, including efficient type casting from BigQuery's generic |
Thanks for the detailed graphs! I bet we could decrease the overshoot in the case of downloading the whole table / query results with |
Just wanted to check whether it's useful to submit a PR with the current half-working version. I expect it may take sometime before BigQuerh GH8044 is solved, and would like to prevent the branch from running out of sync to much. Please let me know what's best. |
Definitely open to adding the I'm not too worried about the extra method for finding minimal schema getting out of sync, since there are fewer chances of git conflicts in that case. We probably want to wait to add it until the issues in |
Thanks for your hard work and patience @dkapitan . Those are superb graphs and I think exactly the right foundation for how to improve performance. In the short term: agree with @tswast - definitely dtypes would be a great addition. In the longer term: I think there are a couple of paths we can take:
I think the latter is likely to be a fuller solution - will be more performant and less likely to have edge cases. It somewhat depends on either a) a good Avro parser than can build numpy-like arrays or b) alternative formats than Avro being available from the bq_storage API. If there are easy changes to make from the former, that's a clear win, and those charts suggest there are some moderate improvements we can make... |
@max-sixty @tswast |
Thanks @dkapitan ! |
@tswast @max-sixty Given pyarrow, I am inclined to start from scratch and rewrite the downcasting feature specifically (only?) for the Arrow format (for now anyway). I expect the memory overshoots to be solved, and am hoping that downcasting is a lot more straightforward since there are no intermediate parsing steps in combining the data. What do you think would be a good way to move forward with this issue? |
In fact, looking at https://arrow.apache.org/docs/python/pandas.html#arrow-pandas-conversion it all boils down how the GBQ Storage API does the type conversion from GBQ to Arrow. If Int64 is already downcasted to the smallest int-type, then this issue is solved. If not, the proposed functionality in this issue could be implemented by post-processing the Arrow table, prior to calling My guess is that the downcasting isn't implemented in the GBQ --> Arrow conversion (haven't looked at the source code yet). |
This seems relevant to this discussion, as to how to handle strings: |
This issue has not been active since 2019. |
#621 for some more recent thoughts related to this. In that feature, I propose |
We use pandas-gbq a lot for our daily analyses. It is known that memory consumption can be a pain, see e.g. https://www.dataquest.io/blog/pandas-big-data/
I have started to write a patch, which could be integrated into an enhancement for
read_gbq
(rough idea, details TBD):optimize_memory
optionTrue
, the source table is inspected with a query to get min, max, presence of nulls and % of unique number of strings for INTEGER and STRING columns, respectivelyto_dataframe
this information is passed to thedtypes
option, downcasting integers to the appropriate numpy (u)int type, and converting strings to pandascategory
type at some threshold (less than 50% of unique values)I already have a working monkey-patch, which is still a bit rough. If there is enough interest I'd happily make it more robust and submit a PR. Would be my first significant contribution to an open source project, so some help and feedback would be appreciated.
Curious to hear your views on this.
The text was updated successfully, but these errors were encountered: