Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add overloads to get_axis_num #8547

Merged
merged 4 commits into from
Jan 30, 2024
Merged

Conversation

Illviljan
Copy link
Contributor

Add overloads to .get_axis_num because you will get the same type out as you put in.

Seen in #8344.

@Illviljan
Copy link
Contributor Author

Illviljan commented Dec 12, 2023

I'm skeptical get_axis_num should be this public, especially in namedarray where there is no backwards compatibility to think about. Because one of xarray's selling points is to not have to deal with axis-likes. This is quite easy too:

a = xr.namedarray.core.NamedArray(data=np.array([[1, 2, 3], [4, 5, 6]]), dims=("x", "y"))
print(a.dims.index("y"))
1
print(a.get_axis_num("y"))
1

But that's a discussion for another PR.

@max-sixty
Copy link
Collaborator

print(a.dims.index("y"))
1

Good point! I guess the main advantage is the ability to pass tuples to get_axis_num. (Though we could alternatively inherit and override that). I like the idea of using more standard methods...

Though the python error message for .index is particularly bad:

da.dims.index('foo')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 1
----> 1 da.dims.index('foo')

ValueError: tuple.index(x): x not in tuple

@Illviljan
Copy link
Contributor Author

print(a.dims.index("y"))
1

Good point! I guess the main advantage is the ability to pass tuples to get_axis_num. (Though we could alternatively inherit and override that). I like the idea of using more standard methods...

When is it even required for a 3rd party user to use it? We of course have to use a helper like this internally, but when should a user need it?

Though the python error message for .index is particularly bad:

da.dims.index('foo')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 1
----> 1 da.dims.index('foo')

ValueError: tuple.index(x): x not in tuple

I think it's fine? The value you tried to search for in the tuple wasn't there. Makes sense to me.
If we are going to hide it within our own public method I agree that it might be too generic.

@dcherian
Copy link
Contributor

but when should a user need it?

Any time you need to get out of Xarray and pass an axis kwarg, but I agree that it's sugar for dims.index.

@Illviljan
Copy link
Contributor Author

but when should a user need it?

Any time you need to get out of Xarray and pass an axis kwarg, but I agree that it's sugar for dims.index.

Is that what we want to encourage?

It feels out of scope to me for xarray to also handle cases outside xarray-land. We have apply_ufunc (although, quite advanced) that can handle a lot of these cases where we haven't implemented a particular function.

I mainly care about namedarray here. I think the method goes against the spirit of the package and don't think it should be one of the first methods to go public.

Here's some funny little examples:

# A typical Hashable: 
a = xr.namedarray.core.NamedArray(data=np.array([[1, 2, 3], [4, 5, 6]]), dims=("x", "y"))
a.dims.index("y")
1
a.get_axis_num("y")
1

# Another Hashable:
b = xr.namedarray.core.NamedArray(data=np.array([[1, 2, 3], [4, 5, 6]]), dims=(2023, None))
b.dims.index(None)
1
b.get_axis_num(None)
1

# And another Hashable:
c = xr.namedarray.core.NamedArray(data=np.array([[1, 2, 3], [4, 5, 6]]), dims=(("x", "y"), ("r",)))
c.dims.index(("r",))
1
c.get_axis_num(("r",)) # ValueError: 'r' not found in array dimensions (('x', 'y'), ('r',))

# And another Hashable:
d = xr.namedarray.core.NamedArray(data=np.array([[1, 2, 3], [4, 5, 6]]), dims=(b"x", b"y"))
d.dims.index(b"y")
1
d.get_axis_num(b"y") # ValueError: 121 not found in array dimensions (b'x', b'y')

@Illviljan Illviljan added the plan to merge Final call for comments label Dec 13, 2023
@max-sixty
Copy link
Collaborator

ValueError: tuple.index(x): x not in tuple

I think it's fine? The value you tried to search for in the tuple wasn't there. Makes sense to me.
If we are going to hide it within our own public method I agree that it might be too generic.

I guess we disagree a bit there — the value it gives (x) isn't the input — it neither prints the actual input foo nor the possible values.


Maybe we make private? I generally agree we shouldn't encourage it. I do think allowing tuples + giving a better error message are useful, though.

Would also support inheriting from tuple and then just doing the da.dims.index('foo'). That would also let us do #8332 with DimsTuple(...) - ['foo', 'bar']

@TomNicholas
Copy link
Contributor

It feels out of scope to me for xarray to also handle cases outside xarray-land. We have apply_ufunc (although, quite advanced) that can handle a lot of these cases where we haven't implemented a particular function.

80% of the scientists who use xarray regularly don't even know that apply_ufunc exists, let alone are able to use it (even I struggle to use it sometimes). It's extremely common for users to use xarray for a while (especially to start off analysis by opening netCDF-like files) then if they get stuck drop back into numpy to apply some more specific custom computation or manipulation.

Obviously we want users to be able to express everything they want to do in xarray as intuitively as possible, and we shouldn't encourage them to leave xarray, but I don't think we should actively make it harder for them to do so either.

To give a concrete example - many operations in scipy etc. don't quite have the correct API to immediately work with apply_ufunc, and it's often easier for the user to just do:

weird_analysis_func_expecting_numpy_arrays(da.data, axis=da.get_axis_num('x'))

I think this is fine. It's a bit of a special/weird method, but I don't think we should get rid of it.

Here's some funny little examples:

These are interesting, but to me represent issues with either (a) our handling of dimensions as hashables or (b) the current implementation of get_axis_num.

@Illviljan
Copy link
Contributor Author

Illviljan commented Dec 14, 2023

I think those issues also show that it can be better to just rely on python's own tuple methods instead. Here's how they could have done it instead in less code:

weird_analysis_func_expecting_numpy_arrays(da.data, axis=da.dims.index('x'))

I think this is fine. It's a bit of a special/weird method, but I don't think we should get rid of it.

I'm not really pushing to deprecate and remove. I just don't think it should be added to NamedArray as one of the first available, public methods. da.dims.index works better and is shorter for most axis cases.

@TomNicholas
Copy link
Contributor

TomNicholas commented Dec 15, 2023

better to just rely on python's own tuple methods instead. Here's how they could have done it instead in less code:

Except that (as @max-sixty noted above) using .index won't work for applying the function over more than one dimension, i.e.

weird_analysis_func_expecting_numpy_arrays(da.data, axis=da.dims.index(['x', 'y']))
ValueError: tuple.index(x): x not in tuple

We could override .dims to return an object which implements a special version of .index, but at that point you still have a dedicated method for solving this problem, you're just debating the syntax for it.

@headtr1ck
Copy link
Collaborator

headtr1ck commented Dec 15, 2023

Should the input also be typed as str | Iterable[Hashable] to align with many other methods?

@Illviljan
Copy link
Contributor Author

Illviljan commented Dec 15, 2023

using .index won't work for applying the function over more than one dimension,

Yup, the (few?) cases you would need multiple axis it would look like:

axes = tuple(da.dims.index(v) for v in ('x', 'y'))
weird_analysis_func_expecting_numpy_arrays(da.data, axis=axes)

Should the input also be typed as str | Iterable[Hashable] to align with many other methods?

The issue is then the dim None or (2023, "b") wont pass mypy.

Things would have been so much more simple if dims was typed as tuple[str, ...] 😆

@max-sixty
Copy link
Collaborator

Possibly we have strayed too far from the original issue, but one quick comment

The issue is then the dim None or (2023, "b") wont pass mypy.

I think this is OK! For code that wants to be more robust, passing dim=[None] / dim=[(2023, "b")] can be required. At least, that's how I'm holding the move to str | Iterable[Hashable]

@Illviljan
Copy link
Contributor Author

Illviljan commented Dec 15, 2023

Possibly we have strayed too far from the original issue, but one quick comment

The issue is then the dim None or (2023, "b") wont pass mypy.

I think this is OK! For code that wants to be more robust, passing dim=[None] / dim=[(2023, "b")] can be required. At least, that's how I'm holding the move to str | Iterable[Hashable]

Hmm, I find it inconsistent and a constant source for typing difficulties.
For example, I don't like having to do a workaround for dim=var.dims[0] just to get mypy passing, it seem to be as canonical as you can get.
I'm warming up to something like Hashable | list[Hashable] to be easily able to discern between singular and plural dims,

@max-sixty
Copy link
Collaborator

I'm warming up to something like Hashable | list[Hashable] to be easily able to discern between singular and plural dims,

IIRC, the issue with this is that tuples don't work — ('x', 'y') will be a single dim — #6142 for discussion

@andersy005 andersy005 enabled auto-merge (squash) January 29, 2024 23:23
@andersy005 andersy005 merged commit 81f38f3 into pydata:main Jan 30, 2024
28 of 29 checks passed
@andersy005 andersy005 mentioned this pull request Jan 30, 2024
4 tasks
andersy005 added a commit to TomNicholas/xarray that referenced this pull request Jan 30, 2024
* main: (153 commits)
  Add overloads to get_axis_num (pydata#8547)
  Fix CI: temporary pin pytest version to 7.4.* (pydata#8682)
  Bump the actions group with 1 update (pydata#8678)
  [namedarray] split `.set_dims()` into `.expand_dims()` and `broadcast_to()` (pydata#8380)
  Add chunk-friendly code path to `encode_cf_datetime` and `encode_cf_timedelta` (pydata#8575)
  Fix NetCDF4 C version detection (pydata#8675)
  groupby: Don't set `method` by default on flox>=0.9 (pydata#8657)
  Fix automatic broadcasting when wrapping array api class (pydata#8669)
  Fix unstack method when wrapping array api class (pydata#8668)
  Fix `variables` arg typo in `Dataset.sortby()` docstring (pydata#8670)
  dt.weekday_name - removal of function (pydata#8664)
  Add `dev` dependencies to `pyproject.toml` (pydata#8661)
  CI: Pin scientific-python/upload-nightly-action to release sha (pydata#8662)
  Update HOW_TO_RELEASE.md by clarifying where RTD build can be found (pydata#8655)
  ruff: use extend-exclude (pydata#8649)
  new whats-new section (pydata#8652)
  xfail another test on windows (pydata#8648)
  use first element of residual in _nonpolyfit_1d (pydata#8647)
  whatsnew for v2024.01.1
  implement `isnull` using `full_like` instead of `zeros_like` (pydata#7395)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants