Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy(chunked-array) #9522

Merged
merged 20 commits into from
Nov 4, 2024
Merged
9 changes: 9 additions & 0 deletions doc/user-guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,15 @@ is identical to
ds.resample(time=TimeResampler("ME"))


The :py:class:`groupers.UniqueGrouper` accepts an optional ``labels`` kwarg that is not present
in :py:meth:`DataArray.groupby` or :py:meth:`Dataset.groupby`.
Specifying ``labels`` is required when grouping by a lazy array type (e.g. dask or cubed).
The ``labels`` are used to construct the output coordinate (say for a reduction), and aggregations
will only be run over the specified labels.
You may use ``labels`` to also specify the ordering of groups to be used during iteration.
The order will be preserved in the output.


.. _groupby.multiple:

Grouping by multiple variables
Expand Down
17 changes: 8 additions & 9 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,21 @@ New Features
~~~~~~~~~~~~
- Added :py:meth:`DataTree.persist` method (:issue:`9675`, :pull:`9682`).
By `Sam Levang <https://github.com/slevang>`_.
- Support lazy grouping by dask arrays, and allow specifying ordered groups with ``UniqueGrouper(labels=["a", "b", "c"])``
(:issue:`2852`, :issue:`757`).
By `Deepak Cherian <https://github.com/dcherian>`_.

Breaking changes
~~~~~~~~~~~~~~~~


Deprecations
~~~~~~~~~~~~

- Grouping by a chunked array (e.g. dask or cubed) currently eagerly loads that variable in to
memory. This behaviour is deprecated. If eager loading was intended, please load such arrays
manually using ``.load()`` or ``.compute()``. Else pass ``eagerly_compute_group=False``, and
provide expected group labels using the ``labels`` kwarg to a grouper object such as
:py:class:`grouper.UniqueGrouper` or :py:class:`grouper.BinGrouper`.

Bug fixes
~~~~~~~~~
Expand Down Expand Up @@ -91,14 +98,6 @@ New Features
(:issue:`9427`, :pull: `9428`).
By `Alfonso Ladino <https://github.com/aladinor>`_.

Breaking changes
~~~~~~~~~~~~~~~~


Deprecations
~~~~~~~~~~~~


Bug fixes
~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1094,7 +1094,7 @@ def _resample(
f"Received {type(freq)} instead."
)

rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(grouper, group, self, eagerly_compute_group=False)

return resample_cls(
self,
Expand Down
20 changes: 18 additions & 2 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -6747,6 +6747,7 @@ def groupby(
*,
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
eagerly_compute_group: bool = True,
**groupers: Grouper,
) -> DataArrayGroupBy:
"""Returns a DataArrayGroupBy object for performing grouped operations.
Expand All @@ -6762,6 +6763,11 @@ def groupby(
restore_coord_dims : bool, default: False
If True, also restore the dimension order of multi-dimensional
coordinates.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
**groupers : Mapping of str to Grouper or Resampler
Mapping of variable name to group by to :py:class:`Grouper` or :py:class:`Resampler` object.
One of ``group`` or ``groupers`` must be provided.
Expand Down Expand Up @@ -6876,7 +6882,9 @@ def groupby(
)

_validate_groupby_squeeze(squeeze)
rgroupers = _parse_group_and_groupers(self, group, groupers)
rgroupers = _parse_group_and_groupers(
self, group, groupers, eagerly_compute_group=eagerly_compute_group
)
return DataArrayGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

@_deprecate_positional_args("v2024.07.0")
Expand All @@ -6891,6 +6899,7 @@ def groupby_bins(
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
duplicates: Literal["raise", "drop"] = "raise",
eagerly_compute_group: bool = True,
) -> DataArrayGroupBy:
"""Returns a DataArrayGroupBy object for performing grouped operations.

Expand Down Expand Up @@ -6927,6 +6936,11 @@ def groupby_bins(
coordinates.
duplicates : {"raise", "drop"}, default: "raise"
If bin edges are not unique, raise ValueError or drop non-uniques.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.

Returns
-------
Expand Down Expand Up @@ -6964,7 +6978,9 @@ def groupby_bins(
precision=precision,
include_lowest=include_lowest,
)
rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(
grouper, group, self, eagerly_compute_group=eagerly_compute_group
)

return DataArrayGroupBy(
self,
Expand Down
20 changes: 18 additions & 2 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -10378,6 +10378,7 @@ def groupby(
*,
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
eagerly_compute_group: bool = True,
**groupers: Grouper,
) -> DatasetGroupBy:
"""Returns a DatasetGroupBy object for performing grouped operations.
Expand All @@ -10393,6 +10394,11 @@ def groupby(
restore_coord_dims : bool, default: False
If True, also restore the dimension order of multi-dimensional
coordinates.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.
**groupers : Mapping of str to Grouper or Resampler
Mapping of variable name to group by to :py:class:`Grouper` or :py:class:`Resampler` object.
One of ``group`` or ``groupers`` must be provided.
Expand Down Expand Up @@ -10475,7 +10481,9 @@ def groupby(
)

_validate_groupby_squeeze(squeeze)
rgroupers = _parse_group_and_groupers(self, group, groupers)
rgroupers = _parse_group_and_groupers(
self, group, groupers, eagerly_compute_group=eagerly_compute_group
)

return DatasetGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

Expand All @@ -10491,6 +10499,7 @@ def groupby_bins(
squeeze: Literal[False] = False,
restore_coord_dims: bool = False,
duplicates: Literal["raise", "drop"] = "raise",
eagerly_compute_group: bool = True,
) -> DatasetGroupBy:
"""Returns a DatasetGroupBy object for performing grouped operations.

Expand Down Expand Up @@ -10527,6 +10536,11 @@ def groupby_bins(
coordinates.
duplicates : {"raise", "drop"}, default: "raise"
If bin edges are not unique, raise ValueError or drop non-uniques.
eagerly_compute_group: bool
Whether to eagerly compute ``group`` when it is a chunked array.
This option is to maintain backwards compatibility. Set to False
to opt-in to future behaviour, where ``group`` is not automatically loaded
into memory.

Returns
-------
Expand Down Expand Up @@ -10564,7 +10578,9 @@ def groupby_bins(
precision=precision,
include_lowest=include_lowest,
)
rgrouper = ResolvedGrouper(grouper, group, self)
rgrouper = ResolvedGrouper(
grouper, group, self, eagerly_compute_group=eagerly_compute_group
)

return DatasetGroupBy(
self,
Expand Down
Loading
Loading