Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading data from ManifestArrays without saving references to disk first #124

Open
ayushnag opened this issue May 23, 2024 · 3 comments
Open

Comments

@ayushnag
Copy link
Contributor

ayushnag commented May 23, 2024

I am working on a feature in virtualizarr to read dmrpp metadata files and create a virtual xr.Dataset containing manifest array's that can then be virtualized. This is the current workflow:

vdatasets = parser.parse(dmrs)
# vdatasets are xr.Datasets containing ManifestArray's
mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.virtualize.to_kerchunk(filepath=outfile, format=outformat)
ds = xr.open_dataset(outfile, engine="virtualizarr", ...)
ds.time.values

However the chunk manifest, encoding, attrs, etc. is already in mds so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the zarr-python reader in xarray is updated this should be possible. The xarray reader for kerchunk can accept a file or the reference json object directly from kerchunk SingleHdf5ToZarr and MultiZarrToZarr. So similarly can we extract the refs from mds and pass it to xr.open_dataset() directly?

There probably still needs to be a function that extracts the refs so that xarray can make a new Dataset object with all the indexes, cf_time handling, and open_dataset checks.

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
refs = mds.virtualize()
ds = xr.open_dataset(refs, engine="virtualizarr", ...)

Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.time.values
@ayushnag ayushnag reopened this May 23, 2024
@TomNicholas TomNicholas changed the title Reading data from ManifestArray's Loading data from ManifestArrays without saving references to disk first May 23, 2024
@TomNicholas
Copy link
Collaborator

TomNicholas commented Aug 7, 2024

Thinking about this more, once zarr-python Array objects support the manifest storage transformer, we should be able to write a new method on ManifestArray objects which constructs the zarr.Array directly, i.e.

def to_zarr_array(self: ManifestArray) -> zarr.Array:
   ...

This opens up some interesting possibilities. Currently when you call .compute on a virtual dataset you get a NotImplementedError, but with this we could change the behaviour to instead:

  1. Turn the ManifestArray into a zarr.Array
  2. Use xarray's zarr backend machinery to open up that zarr array the same way that normally happens when you do xr.open_zarr
  3. Which includes wrapping with xarray's lazy indexing classes,
  4. Call the .compute behaviour that xarray would normally use.

The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute it it should transform itself into one under the hood!

Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset (or maybe eventually xr.open_dataset(engine='virtualizarr')), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!

I still need to think through some of the details, but this could potentially be a neat alternative approach to pydata/xarray#9281, and not actually require any upstream changes to xarray!

cc @d-v-b

@TomNicholas
Copy link
Collaborator

(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in #18, and then have the indexes understand how they can be loaded.)

@TomNicholas
Copy link
Collaborator

Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants