-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: fast path to generate group idxs for vanilla int_range in group_by_dynamic #19932
base: main
Are you sure you want to change the base?
Conversation
Expr::Function { | ||
input, function, .. | ||
} => { | ||
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit awkward that I have to check against the .to_string
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to:
matches!(
function,
FunctionExpr::Range(RangeFunction::IntRange { .. })
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this check during the IR::conversion, in resolve_groupby
here:
} else if let Some(options) = _options.dynamic.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I am just unsure how I should check in resolve_group_by
whether the index of the group_by_dynamic is an int_range. Can I get the expression for this index (conveniently) from the expr_arena
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I see that we first do a with_columns
here. We should store the index
column on line 1155 and do the with_column
rewrite during IR conversion.
I understand it's a bit more than you anticipated. I can do the pre-work for that later if you like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this involves passing the index_column Expr
from the lazy_frame.group_by_dynamic
method to the dsl_to_ir functions? However, the best approach to do this is not clear to me..
Would thus be very helpful if you could do the pre-work for this! :)
@@ -34,6 +34,7 @@ pub struct DynamicGroupOptions { | |||
pub include_boundaries: bool, | |||
pub closed_window: ClosedWindow, | |||
pub start_by: StartBy, | |||
pub int_range: bool, // Whether the index column is a range column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be private (as we do not want users to set this "flag")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't.
We can add a new DynamicGroupOptionsIR
where we add private flags set by the IR.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19932 +/- ##
===========================================
+ Coverage 59.28% 79.47% +20.18%
===========================================
Files 1555 1562 +7
Lines 216180 217047 +867
Branches 2456 2459 +3
===========================================
+ Hits 128155 172488 +44333
+ Misses 87467 44000 -43467
- Partials 558 559 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! I've left some comments.
Expr::Function { | ||
input, function, .. | ||
} => { | ||
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to:
matches!(
function,
FunctionExpr::Range(RangeFunction::IntRange { .. })
);
Expr::Function { | ||
input, function, .. | ||
} => { | ||
matches!(function, FunctionExpr::Range(f) if f.to_string() == "int_range") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this check during the IR::conversion, in resolve_groupby
here:
} else if let Some(options) = _options.dynamic.as_ref() { |
@@ -34,6 +34,7 @@ pub struct DynamicGroupOptions { | |||
pub include_boundaries: bool, | |||
pub closed_window: ClosedWindow, | |||
pub start_by: StartBy, | |||
pub int_range: bool, // Whether the index column is a range column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't.
We can add a new DynamicGroupOptionsIR
where we add private flags set by the IR.
options.start_by, | ||
); | ||
|
||
let vanilla_start_step = (ts[0] == 0) && (ts[1] - ts[0] == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a separate function so that implementation is separate from dispatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the logic to a separate function in the group_by.rs file
Performance optimization of
group_by_dynamic
when passing a vanillaint_range
(i.e., start=0, step=1) asindex_column
.If we know that the index column for the dynamic group by is an int range, we can generate the group indices (as there is a fixed step between the index values) and can thus avoid the slow
group_by_windows
function.Any feedback to improve this PR is welcome :)
Further enhancements (that could? be done):
Using the updated code, I observe ~10x performance improvements on my machine