Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouper object design doc #8510
Grouper object design doc #8510
Changes from 3 commits
9ce8d9c
b543047
39c5d25
b394451
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fundamentally different are space and time resampling? If the difference is small enough (units / step size), would it be possible to merge both into a single
Resampler
Grouper object? Then it might also be possible to apply this to coordinates with physical dimensions other than space / time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this could be done, they're the same, its just the user interface that could be different. For time it's nice to write
freq="M"
and for spacespacing=2.5
(?). The other arguments carry over but fortime
they can betimedelta
ordatetime
objects. So it feels nice to separate them for easy validation, and makes the user code more readable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point was that we might want to resample using (numeric?) coordinates that are neither time nor space, but "spacing" might be appropriate in that case, as well? In which case we'd only need a general "Resampler" and a specialized "TimeResampler" (4 in total:
Resampler
/RollingResampler
andTimeResampler
/RollingTimeResampler
). Unless I'm missing something?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, agree! The
Resampler
is really just a convenient wrapper aroundBinGrouper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototype Implemented here: https://github.com/dcherian/xarray/blob/grouper-public/xarray/core/groupers.py
cc @tomvothecoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name?
"encode"
?"to_codes"
?"label"
?"to_integer_labels"
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for something involving "labels", as IIUC this method is about grouping coordinate "labels" into common groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.find_groups()
? IIUC The goal of aGrouper
is to find and extract the groups from a variable? Assuming that this method returns all the information needed to represent the groups, wouldn't it be better to encapsulate the returned items in an instance of a simple dataclassGroups
?And maybe rename the method argument from
group
tovar
? I find "group" a bit confusing here... It is more a variable that possibly contains groups of a certain kind.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like
find_groups
. (As an aside this maybe does not depend so much, as this method will almost only be used internally?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. In flox I use
by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late, but throughout the document you're using the term "factorize" when you talk about constructing an array of integer indices for the "split" step, so maybe that means that using
factorize
as the method name would be fine?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't like
factorize
because no one knows what it means :) AFAICT pandas chose that name, and I haven't seen it used anywhere else. I thinklabel_groups
would be best, and most easy to understand for a new user.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that makes sense. However, since this is advanced API I think we can require people to learn the new meaning, which is not actually new since we're borrowing from
pandas
(we could takesplit_indices
,to_split_indices
or just.split
if we wanted to stay closer to "split-apply-combine", though).What does other "split-apply-combine" software call this? From what I can tell:
polars
:to_physical
(I don't get what the background there is)dataframes.jl
:compute_indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be part of the interface? I think it can be calculated from
codes
if necessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it can be calculated.
We'd lose an optimization where the current code uses slices for resampling, and unindexed coordinates. I think it's valuable to preserve the former at least. What do you think of this function returning a named tuple with an optional
indices
property. That way, Groupers can opt in to optimizing the indices, potentially reusing any intermediate step they've computed for generating thecodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a dataclass with some optional fields would indeed be a nice way to handle this. It would also be forward compatible with any future interface changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to_weights"
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"compute_weights"? "find_weights"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights
seems potentially useful, but I'm not sure we need it for composingweighted()
andgroupby()
. For example, you could implementx.weighted(w).groupby(k).mean()
with something like(x * w).groupby(k).sum() / w.groupby(k).sum()
. This would likely be more efficient, too, at least if the grouped over dimension is large.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your implementation point but the
weights
can depend on the parameters to the Grouper object. For example a time bounds-aware resampling would have non-uniform weights, (and the Grouper object is more of a Regridder object). Adding the weights method would allow us to encapsulate that functionality.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to allow these operations to be efficient, we should definitely allow and encourage returning weights as sparse arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite speculative but seems cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interested in what else is out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn's clustering?
https://scikit-learn.org/stable/modules/clustering.html#clustering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would be quite happy to have this directly in xarray! We have are own implementation at xclim (https://github.com/Ouranosinc/xclim/blob/b905e5529f2757a49f4485b7713b0ef01a45df2c/xclim/sdba/base.py#L103) but it's kind of a mess. Grouping per time period with a window is very useful and common in model output statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs input on what we expect custom
factorize
methods to implement.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably best to try implementing the most important use cases from the lists you've been assembling: by doing so I believe we can figure out the most important ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use some input here too, requires some internal cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on the namespace? And what other "groupers" would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me. We'd need to figure out which groupers to declare out-of-scope for inclusion in
xarray
itself, though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this at a dev meeting a while ago, but would be good to rethink and commit to a path now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where reusing Groupers would actually be useful? I think this is mostly just about a syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one where I wanted to use the same bin specification for two variables. But I agree that its minor.
Nominally every property on the grouper is independent of the data variable it is applied to. I think that motivated my suggestion to not encapsulate the data variable in the grouper. But every method on the grouper relies on the data variable as input. So 🤷🏾♂️ .
Do you have any advice here on how to think about the coupling/decoupling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is probably fine. The variables can also be coordinates (e.g., on a DataArray).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with that, how would we support variables that don't align with the dataset? I was under the impression that that currently works (not sure, though). If we require a new dimension name for the grouper variable that would be fine with me, but it's also something we should explicitly state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't. We use an "exact" join today:
xarray/xarray/core/groupby.py
Lines 616 to 620 in 41d33f5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case sounds good to me,
.assign_coords({name: arr})
is not that much more to write.