Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: fast path to generate group idxs for vanilla int_range in group_by_dynamic #19932

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jvdd
Copy link
Contributor

@jvdd jvdd commented Nov 22, 2024

Performance optimization of group_by_dynamic when passing a vanillaint_range (i.e., start=0, step=1) as index_column.
If we know that the index column for the dynamic group by is an int range, we can generate the group indices (as there is a fixed step between the index values) and can thus avoid the slow group_by_windows function.

Any feedback to improve this PR is welcome :)

Further enhancements (that could? be done):

  • support non-vanilla int-ranges (i.e., step != 1 and/or start !=0)
  • do not materialize the int_range (as indices can be generated)

Using the updated code, I observe ~10x performance improvements on my machine

Expr::Function {
input, function, ..
} => {
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit awkward that I have to check against the .to_string method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to:

 matches!(
        function,
        FunctionExpr::Range(RangeFunction::IntRange { .. })
    );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this check during the IR::conversion, in resolve_groupby here:

} else if let Some(options) = _options.dynamic.as_ref() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I am just unsure how I should check in resolve_group_by whether the index of the group_by_dynamic is an int_range. Can I get the expression for this index (conveniently) from the expr_arena?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I see that we first do a with_columns here. We should store the index column on line 1155 and do the with_column rewrite during IR conversion.

I understand it's a bit more than you anticipated. I can do the pre-work for that later if you like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this involves passing the index_column Expr from the lazy_frame.group_by_dynamic method to the dsl_to_ir functions? However, the best approach to do this is not clear to me..

Would thus be very helpful if you could do the pre-work for this! :)

@@ -34,6 +34,7 @@ pub struct DynamicGroupOptions {
pub include_boundaries: bool,
pub closed_window: ClosedWindow,
pub start_by: StartBy,
pub int_range: bool, // Whether the index column is a range column
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be private (as we do not want users to set this "flag")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't.

We can add a new DynamicGroupOptionsIR where we add private flags set by the IR.

Copy link

codecov bot commented Nov 23, 2024

Codecov Report

Attention: Patch coverage is 17.54386% with 94 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (414d883) to head (f98b8fc).
Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-time/src/windows/group_by.rs 0.00% 81 Missing ⚠️
crates/polars-time/src/group_by/dynamic.rs 60.00% 10 Missing ⚠️
crates/polars-lazy/src/frame/mod.rs 62.50% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19932       +/-   ##
===========================================
+ Coverage   59.28%   79.47%   +20.18%     
===========================================
  Files        1555     1562        +7     
  Lines      216180   217047      +867     
  Branches     2456     2459        +3     
===========================================
+ Hits       128155   172488    +44333     
+ Misses      87467    44000    -43467     
- Partials      558      559        +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! I've left some comments.

Expr::Function {
input, function, ..
} => {
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to:

 matches!(
        function,
        FunctionExpr::Range(RangeFunction::IntRange { .. })
    );

Expr::Function {
input, function, ..
} => {
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this check during the IR::conversion, in resolve_groupby here:

} else if let Some(options) = _options.dynamic.as_ref() {

@@ -34,6 +34,7 @@ pub struct DynamicGroupOptions {
pub include_boundaries: bool,
pub closed_window: ClosedWindow,
pub start_by: StartBy,
pub int_range: bool, // Whether the index column is a range column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't.

We can add a new DynamicGroupOptionsIR where we add private flags set by the IR.

options.start_by,
);

let vanilla_start_step = (ts[0] == 0) && (ts[1] - ts[0] == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a separate function so that implementation is separate from dispatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to a separate function in the group_by.rs file

crates/polars-time/src/group_by/dynamic.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants