Change output of `M2` groupby aggregation from a single double column into a structs column #9899

ttnghia · 2021-12-14T18:16:02Z

Currently, M2 groupby aggregation only outputs a single double type column. During computing the m2 values, the intermediate results including groupby count and mean are all discarded. However, for merging these m2 values, we must have the values count and mean available. As such, in Spark plugin, we have to re-compute these values again, which seems to be too inefficient.

This PR addresses that issue: The intermediate values count and mean are all output together with m2 values. In particular, the output result of the M2 groupby aggregation now is a structs column containing tuples of (count, mean, m2).

jrhemstad · 2021-12-14T21:14:30Z

However, for merging these m2 values, we must have the values count and mean available

We shouldn't be returning aggregation results that aren't exactly the aggregation requested. If these aggregations are needed, then they should be requested when performing the groupby::aggregate.

The caching mechanisms will ensure that no aggregation is redundantly computed.

mythrocks · 2021-12-20T18:56:53Z

cpp/src/groupby/groupby.cu

@@ -117,6 +117,14 @@ struct empty_column_constructor {
        0, make_empty_column(type_to_id<offset_type>()), empty_like(values), 0, {});
    }

+    if constexpr (k == aggregation::Kind::M2 || k == aggregation::Kind::MERGE_M2) {
+      std::vector<std::unique_ptr<column>> child_columns;


👍
I'm wondering whether this can be phrased differently, as you have suggested before:

auto begin = cudf::make_counting_transform_iterator(0, [](auto i){ return make_empty_column(FLOAT64); }); return make_structs_column(0, std::vector<std::unique_ptr<column>>(begin, begin+4), 0, {});

mythrocks

LGTM. Minor nitpick.

jrhemstad

See my comment above.

ttnghia · 2022-01-04T17:32:52Z

See my comment above.

Thanks. I'm going to work on the Spark plugin side, trying to optimize the computation there. Will get back and close this if the optimization can be done entirely there without cudf work.

github-actions · 2022-02-03T18:07:15Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-05-09T17:11:31Z

This PR has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates.

vyasr · 2022-07-13T00:15:39Z

@ttnghia were you able to resolve this on the Spark side, or is this still something that we need to add to libcudf?

ttnghia · 2022-07-13T03:34:03Z

@ttnghia were you able to resolve this on the Spark side, or is this still something that we need to add to libcudf?

This needs to be checked with spark-rapids plugin first but I still don't have bandwidth to work on it yet. Will try again some later time.

Thanks.

Change output of M2 aggregate from a single column into a structs column

181f36a

ttnghia requested review from mythrocks and nvdbaranec December 14, 2021 18:16

ttnghia self-assigned this Dec 14, 2021

ttnghia requested a review from a team as a code owner December 14, 2021 18:16

ttnghia requested a review from abellina December 14, 2021 19:53

mythrocks reviewed Dec 20, 2021

View reviewed changes

mythrocks approved these changes Dec 20, 2021

View reviewed changes

jrhemstad requested changes Jan 4, 2022

View reviewed changes

github-actions bot added the inactive-30d label Feb 3, 2022

ttnghia marked this pull request as draft February 8, 2022 16:19

github-actions bot added the inactive-90d label May 9, 2022

github-actions bot removed the inactive-90d label Jul 13, 2022

vyasr removed the inactive-30d label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change output of `M2` groupby aggregation from a single double column into a structs column #9899

Change output of `M2` groupby aggregation from a single double column into a structs column #9899

ttnghia commented Dec 14, 2021

jrhemstad commented Dec 14, 2021

mythrocks Dec 20, 2021

mythrocks left a comment

jrhemstad left a comment

ttnghia commented Jan 4, 2022

github-actions bot commented Feb 3, 2022

github-actions bot commented May 9, 2022

vyasr commented Jul 13, 2022

ttnghia commented Jul 13, 2022

Change output of M2 groupby aggregation from a single double column into a structs column #9899

Are you sure you want to change the base?

Change output of M2 groupby aggregation from a single double column into a structs column #9899

Conversation

ttnghia commented Dec 14, 2021

jrhemstad commented Dec 14, 2021

mythrocks Dec 20, 2021

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

ttnghia commented Jan 4, 2022

github-actions bot commented Feb 3, 2022

github-actions bot commented May 9, 2022

vyasr commented Jul 13, 2022

ttnghia commented Jul 13, 2022

Change output of `M2` groupby aggregation from a single double column into a structs column #9899

Change output of `M2` groupby aggregation from a single double column into a structs column #9899