Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change output of M2 groupby aggregation from a single double column into a structs column #9899

Draft
wants to merge 1 commit into
base: branch-22.02
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Dec 14, 2021

Currently, M2 groupby aggregation only outputs a single double type column. During computing the m2 values, the intermediate results including groupby count and mean are all discarded. However, for merging these m2 values, we must have the values count and mean available. As such, in Spark plugin, we have to re-compute these values again, which seems to be too inefficient.

This PR addresses that issue: The intermediate values count and mean are all output together with m2 values. In particular, the output result of the M2 groupby aggregation now is a structs column containing tuples of (count, mean, m2).

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS 5 - DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function breaking Breaking change labels Dec 14, 2021
@ttnghia ttnghia self-assigned this Dec 14, 2021
@ttnghia ttnghia requested a review from a team as a code owner December 14, 2021 18:16
@ttnghia ttnghia requested a review from abellina December 14, 2021 19:53
@jrhemstad
Copy link
Contributor

However, for merging these m2 values, we must have the values count and mean available

We shouldn't be returning aggregation results that aren't exactly the aggregation requested. If these aggregations are needed, then they should be requested when performing the groupby::aggregate.

The caching mechanisms will ensure that no aggregation is redundantly computed.

@@ -117,6 +117,14 @@ struct empty_column_constructor {
0, make_empty_column(type_to_id<offset_type>()), empty_like(values), 0, {});
}

if constexpr (k == aggregation::Kind::M2 || k == aggregation::Kind::MERGE_M2) {
std::vector<std::unique_ptr<column>> child_columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I'm wondering whether this can be phrased differently, as you have suggested before:

auto begin = cudf::make_counting_transform_iterator(0, [](auto i){ return make_empty_column(FLOAT64); });
return make_structs_column(0, std::vector<std::unique_ptr<column>>(begin, begin+4), 0, {});

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor nitpick.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jan 4, 2022

See my comment above.

Thanks. I'm going to work on the Spark plugin side, trying to optimize the computation there. Will get back and close this if the optimization can be done entirely there without cudf work.

@github-actions
Copy link

github-actions bot commented Feb 3, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@ttnghia ttnghia marked this pull request as draft February 8, 2022 16:19
@github-actions
Copy link

github-actions bot commented May 9, 2022

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

@vyasr
Copy link
Contributor

vyasr commented Jul 13, 2022

@ttnghia were you able to resolve this on the Spark side, or is this still something that we need to add to libcudf?

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 13, 2022

@ttnghia were you able to resolve this on the Spark side, or is this still something that we need to add to libcudf?

This needs to be checked with spark-rapids plugin first but I still don't have bandwidth to work on it yet. Will try again some later time.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants