Splitting out lazy indexing layer and backends layer as zarr-python features #9281

TomNicholas · 2024-07-26T03:11:01Z

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

When you do xr.open_dataset, a few main things happen:

the data on disk is examined and a lazy representation built (which knows the data's shape and dtype)
decoding steps (following CF conventions) are set up ready to happen upon materialization of bytes
materialization of bytes is delayed by xarray's intermediate lazy indexing classes, which build a representation of successive slicing operations

When you do virtualizarr.open_virtual_dataset then also:

a chunk-level metadata-only lazy representation of data on-disk is created (the "chunk Manifest" inside the ManifestArray), which also knows the shape and dtype.

In zarr-developers/zarr-specs#303 we've suggested that instead of various xarray backends instead (1) and (2) could be handled by zarr + chunk manifests + cf-specific zarr codecs.

For (3), note that currently we have lazy indexing in Xarray but not lazy concatenation, and in VirtualiZarr we kind of have lazy chunk-level concatenation without lazy indexing.

(4) is currently implemented separately from zarr-python in virtualizarr, but also notice that a virtualizarr.ManifestArray has all the information needed to actually go fetch data - in other words it could be converted directly to an actual zarr.Array (mentioned by @ayushnag in zarr-developers/VirtualiZarr#124).

Imagine that we enabled the zarr.Array type (or some new VirtualZarrArray type) to do both indexing and concatenation lazily (proposed in zarr-developers/zarr-python#1603), and open netCDF / other files via the chunk manifest (see zarr-developers/zarr-specs#287). It could also write out just its metadata to disk via the chunk manifest ZEP. This would then:

Basically replace the virtualizarr.ManifestArray,
Be wrapped by Xarray to provide both the "universal reader" of Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and also lazy slicing & concatenation operations (see Lazy indexing arrays as a stand-alone package #5081).

The result would be that xarray users would basically open data (netCDF or zarr or otherwise) and see VirtualZarrArrays wrapped by Xarray. They could then do lazy operations as they do now, and either load actual values via .compute or save only the lazy metadata representation to disk as a virtual zarr store (i.e. what virtualizarr does right now). The latter could be created by special serialization functions that understand how to translate a chain of lazy Zarr array operations into a valid metadata-only zarr-compliant format on-disk, or you could even imagine ds.to_zarr having a boolean virtual kwarg to cover both cases.

The lazy layer could either be implemented either inside zarr or live on top of it and be importable from other packages (i.e. #5081, see also data-apis/array-api#777).

All together this would give you:

Zarr arrays that can open and decode netCDF directly (a la Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303)
Lazy Zarr arrays even without Xarray
Ability to save virtual datasets without needing a dedicated ManifestArray type (i.e. the lazy concatenation functionality of VirtualiZarr in zarr-python itself)
Separation of the metadata-reading logic of kerchunk/VirtualiZarr from the lazy concatenation stuff, so VirtualiZarr gets demoted to just being a repository for readers for specific file formats and codecs for them.
Complete separation of:

finding byte ranges from archival formats (VirtualiZarr / kerchunk readers for specific file formats),
reading bytes (zarr.Array),
decoding bytes following CF (new CF zarr codecs mentioned in Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs zarr-developers/zarr-specs#303 and Expose a public interface for CF encoding/decoding functions #155),
lazy operations (new lazy operations package),
handling of named variables / dimensions (Xarray),
serialization to metadata-only virtual Zarr store (ds.to_zarr(path, virtual=True) calling VirtualZarrArray).

The main subtlety I see here is selection in index-space vs chunk-space - xarray does the former but VirtualiZarr does the latter (see also zarr-developers/VirtualiZarr#183). This is what @d-v-d was getting at in zarr-developers/VirtualiZarr#71.

Whilst this is a longer-term roadmap idea, now is the time to think about it because of the malleability of zarr-python right now (e.g. zarr-developers/zarr-python#2052).

cc @dcherian @jhamman @joshmoore @sharkinsspatial @abarciauskas-bgse

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-07-26T14:21:50Z

To go even further - I think you could imagine replacing xarray's entire backends layer once you had this.

Right now there are effectively two totally distinct ways to teach xarray how to open a new file format:

Write an xarray backend engine (that works by creating a subclass of xarray.BackendArray
Add a kerchunk reader / virtualizarr reader (that works by reading metadata and determining byte ranges to data on-disk, leaving reading those data bytes for later)

I'm suggesting that with enough virtualization / lazy features upstreamed into zarr, xarray might be able to use (2) for everything.

Are there any types of files for which this approach would not work?

cc @kmuehlbauer @shoyer

TomNicholas · 2024-08-07T20:01:26Z

Thinking about this more, I have an alternative shorter-term suggestion that I think could get a lot of these benefits with no extra changes required to xarray or zarr-python beyond what is already planned - see zarr-developers/VirtualiZarr#124 (comment)

dcherian · 2024-08-07T20:29:05Z

+10 for engine="virtualizarr" that supports a range of virtual reference formats.

Your more general suggestion has the drawback that we'd have to scan a file before ever reading it if the references didn't exist, and nothing guarantees these references are up-to-date.

TomNicholas added topic-backends topic-internals topic-indexing enhancement topic-zarr Related to zarr storage library topic-arrays related to flexible array support topic-lazy array topic-chunked-arrays Managing different chunked backends, e.g. dask labels Jul 26, 2024

TomNicholas mentioned this issue Jul 26, 2024

A basic default ChunkManager for arrays that report their own chunks #8733

Open

This was referenced Aug 1, 2024

Get xarray.testing.assert_identical to work on datasets containing ManifestArrays zarr-developers/VirtualiZarr#161

Closed

Loading data from ManifestArrays without saving references to disk first zarr-developers/VirtualiZarr#124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

TomNicholas commented Jul 26, 2024 •

edited

Loading

TomNicholas commented Jul 26, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024

dcherian commented Aug 7, 2024

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Splitting out lazy indexing layer and backends layer as zarr-python features #9281

Comments

TomNicholas commented Jul 26, 2024 • edited Loading

What is your issue?

tl;dr: Could we factor out all of xarray's lazy indexing + backends as fully-featured virtual lazy zarr arrays?

TomNicholas commented Jul 26, 2024 • edited Loading

TomNicholas commented Aug 7, 2024

dcherian commented Aug 7, 2024

TomNicholas commented Jul 26, 2024 •

edited

Loading

TomNicholas commented Jul 26, 2024 •

edited

Loading