Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get xarray.testing.assert_identical to work on datasets containing ManifestArrays #161

Closed
TomNicholas opened this issue Jun 25, 2024 · 4 comments · Fixed by #188
Closed
Labels
bug Something isn't working testing xarray Requires changes to xarray upstream

Comments

@TomNicholas
Copy link
Collaborator

We should be able to get xarray.testing.assert_identical to work on datasets containing ManifestArrays. I'm not sure why it doesn't already work (at least for the case of no indexes).

AssertionError: Left and right Dataset objects are not identical

Differing coordinates:
L   ocean_time  (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
R   ocean_time  (ocean_time) float64 16B ManifestArray<shape=(2,), dtype=floa...
L   s_rho       (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R   s_rho       (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
Differing data variables:
L   lat_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   lat_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   Cs_r        (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
R   Cs_r        (s_rho) float64 240B ManifestArray<shape=(30,), dtype=float64...
L   h           (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   h           (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   lon_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
R   lon_rho     (eta_rho, xi_rho) float64 567kB ManifestArray<shape=(191, 371...
L   hc          float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
R   hc          float64 8B ManifestArray<shape=(), dtype=float64, chunks=()>
L   salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
R   salt        (ocean_time, s_rho, eta_rho, xi_rho) float32 17MB ManifestArr...
L   zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...
R   zeta        (ocean_time, eta_rho, xi_rho) float32 567kB ManifestArray<sha...

Originally posted by @scottyhq in #143 (comment)

@scottyhq
Copy link
Contributor

I think this one should be reopened @TomNicholas, still some issues I'm seeing:

Using the example from the docs, this is the only incantation I've found that currently passes :) I'm looking at array_api.py tying to figure out what it does. I'm guessing assert_identical somehow loads each ManifestArray because it needs to check the values?

ds = xr.tutorial.open_dataset('air_temperature', decode_times=False)
ds.to_netcdf('air.nc')
vds = open_virtual_dataset('air.nc',
                           indexes={},
                           loadable_variables=['air','time','lon','lat'],
)

xr.testing.assert_identical(ds, vds)

@scottyhq
Copy link
Contributor

Tracebacks:

1

I think this Error is coming from 'air' having a 'scale_factor' attribute...(#68)

ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')
vds = open_virtual_dataset('air.nc')
xr.testing.assert_identical(ds, vds)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

2

(Doesn't work unless indexes={} passed to open_virtual_dataset)

xr.testing.assert_identical(ds.lat, vds.lat)
# TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'equal'>, '__call__', array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. ,
       47.5, 45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5,
       20. , 17.5, 15. ], dtype=float32), ManifestArray<shape=(25,), dtype=float32, chunks=(25,)>): 'ndarray', 'ManifestArray'

3

(Differing time values, not sure why)

vds = open_virtual_dataset('air.nc',
                           indexes={},
                           loadable_variables=['air','lon','lat','time'],
                           cftime_variables=['time'])

xr.testing.assert_identical(ds, vds)
AssertionError: Left and right Dataset objects are not identical

Differing coordinates:
L * time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
R * time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...

@TomNicholas
Copy link
Collaborator Author

@scottyhq you're misunderstanding - ds and vds are not supposed to be identical!

I'm guessing assert_identical somehow loads each ManifestArray because it needs to check the values?

There is currently no way to load the values of a ManifestArray directly. The only way to get at them is to save the information in the manifestarray to disk (as kerchunk references) then open and load a normal ds using them.

(For the general idea that it might be useful to load a ManifestArray, see #124 and pydata/xarray#9281)

What these assertions do at the moment is go through ManifestArray.__eq__, which has its own definition of equality. It is intended for comparing against other ManifestArray objects by checking that all entries in the chunk manifest are the same, i.e.

xr.testing.assert_identical(vds1, vds2)

It's supposed to return False when comparing against anything that is not a ManifestArray, but the diversity of errors you're getting here might be partly due to this TODO:

# TODO what should this do when comparing against numpy arrays?

(Note also that this __eq__ method is getting an upgrade in #31, though that's a feature orthogonal to this issue.)

@TomNicholas
Copy link
Collaborator Author

TomNicholas commented Aug 1, 2024

Your example with the loadable_variable (and the example in #165 (comment)) are different because loadable variables become just normal (lazy) numpy arrays, so a virtual dataset with all loadable variables is not virtual at all, it's just a normal xarray Dataset!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working testing xarray Requires changes to xarray upstream
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants