Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Open
TomNicholas opened this issue Jul 26, 2024 · 3 comments
Labels
enhancement topic-arrays related to flexible array support topic-backends topic-chunked-arrays Managing different chunked backends, e.g. dask topic-indexing topic-internals topic-lazy array topic-zarr Related to zarr storage library

Comments

@TomNicholas
Copy link
Contributor

TomNicholas commented Jul 26, 2024

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

When you do xr.open_dataset, a few main things happen:

  1. the data on disk is examined and a lazy representation built (which knows the data's shape and dtype)
  2. decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
  3. materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations

When you do virtualizarr.open_virtual_dataset then also:

  1. a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the ManifestArray), which also knows the shape and dtype.

In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.

For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.

(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a virtualizarr.ManifestArray has all the information needed to actually go fetch data - in other words it could be converted directly to an actual zarr.Array (mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).


Imagine that we enabled the zarr.Array type (or some new VirtualZarrArray type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:

The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see VirtualZarrArrays wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via .compute or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what virtualizarr does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine ds.to_zarr having a boolean virtual kwarg to cover both cases.

The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).

All together this would give you:

  1. Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)

  2. Lazy Zarr arrays even without Xarray

  3. Ability to save virtual datasets without needing a dedicated ManifestArray type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself)

  4. Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.

  5. Complete separation of:

The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.

Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).

cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse

@TomNicholas
Copy link
Contributor Author

TomNicholas commented Jul 26, 2024

To go even further - I think you could imagine replacing xarray's entire backends layer once you had this.

Right now there are effectively two totally distinct ways to teach xarray how to open a new file format:

  1. Write an xarray backend engine (that works by creating a subclass of xarray.BackendArray
  2. Add a kerchunk reader / virtualizarr reader (that works by reading metadata and determining byte ranges to data on-disk, leaving reading those data bytes for later)

I'm suggesting that with enough virtualization / lazy features upstreamed into zarr, xarray might be able to use (2) for everything.

Are there any types of files for which this approach would not work?

cc @kmuehlbauer @shoyer

@TomNicholas
Copy link
Contributor Author

Thinking about this more, I have an alternative shorter-term suggestion that I think could get a lot of these benefits with no extra changes required to xarray or zarr-python beyond what is already planned - see zarr-developers/VirtualiZarr#124 (comment)

@dcherian
Copy link
Contributor

dcherian commented Aug 7, 2024

+10 for engine="virtualizarr" that supports a range of virtual reference formats.

Your more general suggestion has the drawback that we'd have to scan a file before ever reading it if the references didn't exist, and nothing guarantees these references are up-to-date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-arrays related to flexible array support topic-backends topic-chunked-arrays Managing different chunked backends, e.g. dask topic-indexing topic-internals topic-lazy array topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

2 participants