diff --git a/README.md b/README.md index b3901f6..c481d54 100644 --- a/README.md +++ b/README.md @@ -8,11 +8,19 @@ _Please see the [documentation](https://virtualizarr.readthedocs.io/en/latest/)_ ### Development Status and Roadmap -VirtualiZarr is ready to use for many of the tasks that we are used to using kerchunk for, but the most general and powerful vision of this library can only be implemented once certain changes upstream in Zarr have occurred. +VirtualiZarr version 1 (mostly) achieves [feature parity](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk. -VirtualiZarr is therefore evolving in tandem with developments in the Zarr Specification, which then need to be implemented in specific Zarr reader implementations (especially the Zarr-Python V3 implementation). There is an [overall roadmap for this integration with Zarr](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA), whose final completion requires acceptance of at least two new Zarr Enhancement Proposals (the ["Chunk Manifest"](https://github.com/zarr-developers/zarr-specs/issues/287) and ["Virtual Concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) ZEPs). +Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages. -Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today. +We have a lot of ideas, including: +- [Zarr v3 support](https://github.com/zarr-developers/VirtualiZarr/issues/17) +- [Zarr-native on-disk chunk manifest format](https://github.com/zarr-developers/zarr-specs/issues/287) +- ["Virtual concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) of separate Zarr arrays +- ManifestArrays as an [intermediate layer in-memory](https://github.com/zarr-developers/VirtualiZarr/issues/71) in Zarr-Python +- [Separating CF-related Codecs from xarray](https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388) +- [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78) + +If you see other opportunities then we would love to hear your ideas! ### Credits diff --git a/docs/faq.md b/docs/faq.md index 3e45cb1..df4af74 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -53,8 +53,16 @@ The reasons why VirtualiZarr has been developed as separate project rather than ## What is the Development Status and Roadmap? -VirtualiZarr is ready to use for many of the tasks that we are used to using kerchunk for, but the most general and powerful vision of this library can only be implemented once certain changes upstream in Zarr have occurred. +VirtualiZarr version 1 (mostly) achieves [feature parity](#how-do-virtualizarr-and-kerchunk-compare) with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk. -VirtualiZarr is therefore evolving in tandem with developments in the Zarr Specification, which then need to be implemented in specific Zarr reader implementations (especially the Zarr-Python V3 implementation). There is an [overall roadmap for this integration with Zarr](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA), whose final completion requires acceptance of at least two new Zarr Enhancement Proposals (the ["Chunk Manifest"](https://github.com/zarr-developers/zarr-specs/issues/287) and ["Virtual Concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) ZEPs). +Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages. -Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today. +We have a lot of ideas, including: +- [Zarr v3 support](https://github.com/zarr-developers/VirtualiZarr/issues/17) +- [Zarr-native on-disk chunk manifest format](https://github.com/zarr-developers/zarr-specs/issues/287) +- ["Virtual concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) of separate Zarr arrays +- ManifestArrays as an [intermediate layer in-memory](https://github.com/zarr-developers/VirtualiZarr/issues/71) in Zarr-Python +- [Separating CF-related Codecs from xarray](https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388) +- [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78) + +If you see other opportunities then we would love to hear your ideas! diff --git a/docs/index.md b/docs/index.md index 98eaa40..d1beb29 100644 --- a/docs/index.md +++ b/docs/index.md @@ -18,16 +18,13 @@ VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the ## Aim -**NOTE: This package is in development. The usage examples in this section are currently aspirational. -See the [Usage docs page](#usage) to see what API works today. Progress towards making all of these examples work is tracked in [issue #2](https://github.com/TomNicholas/VirtualiZarr/issues/2).** - -Let's say you have a bunch of legacy files (e.g. netCDF) which together tile to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).) +Let's say you have a bunch of legacy files (e.g. netCDF) which together tile along a dimension to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).) ```python ds = xr.open_mfdataset( '/my/files*.nc', engine='h5netcdf', - combine='by_coords', # 'by_coords' requires reading coord data to determine concatenation order + combine='nested', ) ds # the complete lazy xarray dataset ``` @@ -38,18 +35,20 @@ However, you don't want to run this set of xarray operations every time you open What's being cached here, you ask? We're effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly. +### Usage + Creating the virtual store looks very similar to how we normally open data with xarray: ```python -import virtualizarr # required for the xarray backend and accessor to be present +from virtualizarr import open_virtual_dataset -virtual_ds = xr.open_mfdataset( - '/my/files*.nc', - engine='virtualizarr', # virtualizarr registers an xarray IO backend that returns ManifestArray objects - combine='by_coords', # 'by_coords' stills requires actually reading coordinate data -) +virtual_datasets = [ + open_virtual_dataset(filepath) + for filepath in glob.glob('/my/files*.nc') +] -virtual_ds # now wraps a bunch of virtual ManifestArray objects directly +# this Dataset wraps a bunch of virtual ManifestArray objects directly +virtual_ds = xr.combine_nested(virtual_datasets, concat_dim=['time']) # cache the combined dataset pattern to disk, in this case using the existing kerchunk specification for reference files virtual_ds.virtualize.to_kerchunk('combined.json', format='json') @@ -68,6 +67,8 @@ ds = xr.open_dataset(m, engine='kerchunk', chunks={}) # normal xarray.Dataset o No data has been loaded or copied in this process, we have merely created an on-disk lookup table that points xarray into the specific parts of the original netCDF files when it needs to read each chunk. +See the [Usage docs page](#usage) for more details. + ## Licence Apache 2.0 diff --git a/docs/releases.rst b/docs/releases.rst index 6a9ae28..4ba7912 100644 --- a/docs/releases.rst +++ b/docs/releases.rst @@ -39,6 +39,7 @@ Bug fixes Documentation ~~~~~~~~~~~~~ +- Updated the development roadmap in preparation for v1.0. (:pull:`164`) - Warn if user passes `indexes=None` to `open_virtual_dataset` to indicate that this is not yet fully supported. (:pull:`170`) By `Tom Nicholas `_. - Clarify that virtual datasets cannot be treated like normal xarray datasets. (:issue:`173`)