-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change output of M2
groupby aggregation from a single double column into a structs column
#9899
base: branch-22.02
Are you sure you want to change the base?
Conversation
We shouldn't be returning aggregation results that aren't exactly the aggregation requested. If these aggregations are needed, then they should be requested when performing the The caching mechanisms will ensure that no aggregation is redundantly computed. |
@@ -117,6 +117,14 @@ struct empty_column_constructor { | |||
0, make_empty_column(type_to_id<offset_type>()), empty_like(values), 0, {}); | |||
} | |||
|
|||
if constexpr (k == aggregation::Kind::M2 || k == aggregation::Kind::MERGE_M2) { | |||
std::vector<std::unique_ptr<column>> child_columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'm wondering whether this can be phrased differently, as you have suggested before:
auto begin = cudf::make_counting_transform_iterator(0, [](auto i){ return make_empty_column(FLOAT64); });
return make_structs_column(0, std::vector<std::unique_ptr<column>>(begin, begin+4), 0, {});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Minor nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above.
Thanks. I'm going to work on the Spark plugin side, trying to optimize the computation there. Will get back and close this if the optimization can be done entirely there without cudf work. |
This PR has been labeled |
This PR has been labeled |
@ttnghia were you able to resolve this on the Spark side, or is this still something that we need to add to libcudf? |
This needs to be checked with spark-rapids plugin first but I still don't have bandwidth to work on it yet. Will try again some later time. Thanks. |
Currently,
M2
groupby aggregation only outputs a single double type column. During computing them2
values, the intermediate results including groupbycount
andmean
are all discarded. However, for merging thesem2
values, we must have the valuescount
andmean
available. As such, in Spark plugin, we have to re-compute these values again, which seems to be too inefficient.This PR addresses that issue: The intermediate values
count
andmean
are all output together withm2
values. In particular, the output result of theM2
groupby aggregation now is a structs column containing tuples of (count
,mean
,m2
).