Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GroupBy.shuffle() #9320

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from
Draft

Add GroupBy.shuffle() #9320

wants to merge 40 commits into from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 7, 2024

This adds some new API to shuffle an Xarray object. Shuffling means we sort so that members of a group occur in the same chunk, with the possibility of multiple groups in a single chunk.

gb = ds.groupby(..)
shuffled : DatasetGroupBy = gb.shuffle()
shuffled.quantile()

I've also added shuffle_by to DataArray and Dataset. This generalizes sortby, and lets you persist a shuffled Xarray object to disk.

ds.shuffle_by(grouper)
  • Closes #xxxx
  • Tests added
    • needs test for .shuffle_by
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • wait on dask 2024.08.1
  • upstream flox release
  • add chunks to signature.

cc @phofl

xarray/core/groupby.py Outdated Show resolved Hide resolved
xarray/core/groupby.py Outdated Show resolved Hide resolved
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024
dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024
* main:
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  try to fix scheduled hypothesis test (pydata#9358)
@dcherian dcherian changed the title Add GroupBy.shuffle() Add GroupBy.shuffle(), DataArray.shuffle_by, Dataset.shuffle_by Aug 15, 2024
* main:
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
* main:
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
* main:
  Accessibility: Add keyboard handling for XArray HTML view (pydata#9412)
  [pre-commit.ci] pre-commit autoupdate (pydata#9316)
  [skip-ci] Speed up docs build by limiting toctrees (pydata#9395)
  fix the failing `pre-commit.ci` runs (pydata#9411)
  Update benchmarks.yml (pydata#9406)
  GroupBy(multiple groupers) (pydata#9372)
  Encode/decode property tests use variables() (pydata#9401)
@dcherian dcherian changed the title Add GroupBy.shuffle(), DataArray.shuffle_by, Dataset.shuffle_by Add GroupBy.shuffle() Aug 30, 2024
doc/api.rst Show resolved Hide resolved
doc/api.rst Show resolved Hide resolved
@dcherian
Copy link
Contributor Author

@aulemahal using shuffle should massively improve your "Padded DOY Grouper" thing. Can you let us know what API would work for your use case? Trying it out would also be quite valuable.

* main: (29 commits)
  Release notes for v2024.09.0 (pydata#9480)
  Fix `DataTree.coords.__setitem__` by adding `DataTreeCoordinates` class (pydata#9451)
  Rename DataTree's "ds" and "data" to "dataset" (pydata#9476)
  Update DataTree repr to indicate inheritance (pydata#9470)
  Bump pypa/gh-action-pypi-publish in the actions group (pydata#9460)
  Repo checker (pydata#9450)
  Add days_in_year and decimal_year to dt accessor (pydata#9105)
  remove parent argument from DataTree.__init__ (pydata#9465)
  Fix inheritance in DataTree.copy() (pydata#9457)
  Implement `DataTree.__delitem__` (pydata#9453)
  Add ASV for datatree.from_dict (pydata#9459)
  Make the first argument in DataTree.from_dict positional only (pydata#9446)
  Fix typos across the code, doc and comments (pydata#9443)
  DataTree should not be "Generic" (pydata#9445)
  Disallow passing a DataArray as data into the DataTree constructor (pydata#9444)
  Support additional dtypes in `resample` (pydata#9413)
  Shallow copy parent and children in DataTree constructor (pydata#9297)
  Bump minimum versions for dependencies (pydata#9434)
  Always include at least one category in random test data (pydata#9436)
  Avoid deep-copy when constructing groupby codes (pydata#9429)
  ...
* main:
  Opt out of floor division for float dtype time encoding (pydata#9497)
  fixed formatting for whats-new (pydata#9493)
  Forbid modifying names of DataTree objects with parents (pydata#9494)
  DAS-2155 - Merge datatree documentation into main docs. (pydata#9033)
  Make illegal path-like variable names when constructing a DataTree from a Dataset (pydata#9378)
  Ensure TreeNode doesn't copy in-place (pydata#9482)
  `open_groups` for zarr backends (pydata#9469)
  Update pyproject.toml (pydata#9484)
  New whatsnew section (pydata#9483)
* main:
  Turn off survey banner (pydata#9512)
  Stateful test: silence DeprecationWarning from drop_dims (pydata#9508)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-chunked-arrays Managing different chunked backends, e.g. dask topic-dask topic-groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants