Loading data from ManifestArrays without saving references to disk first #124

ayushnag · 2024-05-23T16:37:31Z

I am working on a feature in virtualizarr to read dmrpp metadata files and create a virtual xr.Dataset containing manifest array's that can then be virtualized. This is the current workflow:

vdatasets = parser.parse(dmrs)
# vdatasets are xr.Datasets containing ManifestArray's
mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.virtualize.to_kerchunk(filepath=outfile, format=outformat)
ds = xr.open_dataset(outfile, engine="virtualizarr", ...)
ds.time.values

However the chunk manifest, encoding, attrs, etc. is already in mds so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the zarr-python reader in xarray is updated this should be possible. The xarray reader for kerchunk can accept a file or the reference json object directly from kerchunk SingleHdf5ToZarr and MultiZarrToZarr. So similarly can we extract the refs from mds and pass it to xr.open_dataset() directly?

There probably still needs to be a function that extracts the refs so that xarray can make a new Dataset object with all the indexes, cf_time handling, and open_dataset checks.

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
refs = mds.virtualize()
ds = xr.open_dataset(refs, engine="virtualizarr", ...)

Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.time.values

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-08-07T16:16:43Z

Thinking about this more, once zarr-python Array objects support the manifest storage transformer, we should be able to write a new method on ManifestArray objects which constructs the zarr.Array directly, i.e.

def to_zarr_array(self: ManifestArray) -> zarr.Array:
   ...

This opens up some interesting possibilities. Currently when you call .compute on a virtual dataset you get a NotImplementedError, but with this we could change the behaviour to instead:

Turn the ManifestArray into a zarr.Array
Use xarray's zarr backend machinery to open up that zarr array the same way that normally happens when you do xr.open_zarr
Which includes wrapping with xarray's lazy indexing classes,
Call the .compute behaviour that xarray would normally use.

The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute it it should transform itself into one under the hood!

Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset (or maybe eventually xr.open_dataset(engine='virtualizarr')), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!

I still need to think through some of the details, but this could potentially be a neat alternative approach to pydata/xarray#9281, and not actually require any upstream changes to xarray!

cc @d-v-b

TomNicholas · 2024-08-07T16:25:43Z

(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in #18, and then have the indexes understand how they can be loaded.)

TomNicholas · 2024-08-07T17:35:13Z

Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.

ayushnag closed this as completed May 23, 2024

ayushnag reopened this May 23, 2024

TomNicholas changed the title ~~Reading data from ManifestArray's~~ Loading data from ManifestArrays without saving references to disk first May 23, 2024

This was referenced Jul 26, 2024

Splitting out lazy indexing layer and backends layer as zarr-python features pydata/xarray#9281

Open

Get xarray.testing.assert_identical to work on datasets containing ManifestArrays #161

Closed

TomNicholas mentioned this issue Aug 20, 2024

Xarray backend which loads data by default #221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading data from ManifestArrays without saving references to disk first #124

Loading data from ManifestArrays without saving references to disk first #124

ayushnag commented May 23, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024

TomNicholas commented Aug 7, 2024

Loading data from ManifestArrays without saving references to disk first #124

Loading data from ManifestArrays without saving references to disk first #124

Comments

ayushnag commented May 23, 2024 • edited Loading

TomNicholas commented Aug 7, 2024 • edited Loading

TomNicholas commented Aug 7, 2024

TomNicholas commented Aug 7, 2024

ayushnag commented May 23, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading