Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] improve performance for polars' pivot_longer #1377

Merged
merged 31 commits into from
Jul 4, 2024

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jun 18, 2024

PR Description

Please describe the changes proposed in the pull request:

  • improve performance for pivot_longer - some cases can be 3x
  • use polars methods as much as possible
  • use implode/explode approach - work on small set of data and blow up only at the end (good perf benefits)
  • for lazyframes, if possible avoid .collect - use another option to avoid this and be as lazy for as long as possible

This PR relates to #1352 .

perf ... YMMV :

import polars as pl
import janitor.polars

evv = pl.read_csv('../evv.csv')
evv.shape
(30000, 801)
# dev 
 %timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep='_')
1.5 s ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
3 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
5.94 s ± 24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR
%timeit evv.janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
225 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_")
1.58 ms ± 4.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit evv.lazy().janitor.pivot_longer(index='country', names_to = ['event','year','gender','num'], names_transform=pl.col('year').cast(int),names_sep="_").collect()
263 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@samukweku samukweku self-assigned this Jun 18, 2024
@samukweku samukweku requested review from ericmjl, hectormz, thatlittleboy and a team June 18, 2024 03:06
@ericmjl
Copy link
Member

ericmjl commented Jun 18, 2024

Copy link

codecov bot commented Jun 20, 2024

Codecov Report

Attention: Patch coverage is 95.74468% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.96%. Comparing base (62c57c6) to head (6a5f66e).
Report is 27 commits behind head on dev.

Current head 6a5f66e differs from pull request most recent head 1fc553e

Please upload reports for the commit 1fc553e to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1377      +/-   ##
==========================================
- Coverage   94.48%   88.96%   -5.52%     
==========================================
  Files          80       86       +6     
  Lines        4367     5058     +691     
==========================================
+ Hits         4126     4500     +374     
- Misses        241      558     +317     

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @samukweku! Please feel free to merge when ready!

@samukweku samukweku merged commit d0c2544 into dev Jul 4, 2024
4 checks passed
@samukweku samukweku deleted the samukweku/polars_pivot_longer_improve branch July 4, 2024 20:59
@samukweku
Copy link
Collaborator Author

samukweku commented Jul 4, 2024

@ericmjl Ok to do a release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants