Skip to content

Commit

Permalink
Update roadmap for v1.0 (#164)
Browse files Browse the repository at this point in the history
* update roadmap

* change example to one that's closer to working

* use combine_nested in the example to fully avoid false advertising

* versions->development

* release notes

* Streamline language about future work

Co-authored-by: Aimee Barciauskas <[email protected]>

* same streamlining in text on faw page

* clarify that virtualizarr v1 achieves feature parity with kerchunks combining logic, not all of kerchunk

* links to feature comparison

---------

Co-authored-by: Aimee Barciauskas <[email protected]>
  • Loading branch information
TomNicholas and abarciauskas-bgse committed Jul 2, 2024
1 parent 2822961 commit 91ebefe
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 18 deletions.
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,19 @@ _Please see the [documentation](https://virtualizarr.readthedocs.io/en/latest/)_

### Development Status and Roadmap

VirtualiZarr is ready to use for many of the tasks that we are used to using kerchunk for, but the most general and powerful vision of this library can only be implemented once certain changes upstream in Zarr have occurred.
VirtualiZarr version 1 (mostly) achieves [feature parity](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.

VirtualiZarr is therefore evolving in tandem with developments in the Zarr Specification, which then need to be implemented in specific Zarr reader implementations (especially the Zarr-Python V3 implementation). There is an [overall roadmap for this integration with Zarr](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA), whose final completion requires acceptance of at least two new Zarr Enhancement Proposals (the ["Chunk Manifest"](https://github.com/zarr-developers/zarr-specs/issues/287) and ["Virtual Concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) ZEPs).
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.

Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today.
We have a lot of ideas, including:
- [Zarr v3 support](https://github.com/zarr-developers/VirtualiZarr/issues/17)
- [Zarr-native on-disk chunk manifest format](https://github.com/zarr-developers/zarr-specs/issues/287)
- ["Virtual concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) of separate Zarr arrays
- ManifestArrays as an [intermediate layer in-memory](https://github.com/zarr-developers/VirtualiZarr/issues/71) in Zarr-Python
- [Separating CF-related Codecs from xarray](https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388)
- [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78)

If you see other opportunities then we would love to hear your ideas!

### Credits

Expand Down
14 changes: 11 additions & 3 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,16 @@ The reasons why VirtualiZarr has been developed as separate project rather than

## What is the Development Status and Roadmap?

VirtualiZarr is ready to use for many of the tasks that we are used to using kerchunk for, but the most general and powerful vision of this library can only be implemented once certain changes upstream in Zarr have occurred.
VirtualiZarr version 1 (mostly) achieves [feature parity](#how-do-virtualizarr-and-kerchunk-compare) with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.

VirtualiZarr is therefore evolving in tandem with developments in the Zarr Specification, which then need to be implemented in specific Zarr reader implementations (especially the Zarr-Python V3 implementation). There is an [overall roadmap for this integration with Zarr](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA), whose final completion requires acceptance of at least two new Zarr Enhancement Proposals (the ["Chunk Manifest"](https://github.com/zarr-developers/zarr-specs/issues/287) and ["Virtual Concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) ZEPs).
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.

Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today.
We have a lot of ideas, including:
- [Zarr v3 support](https://github.com/zarr-developers/VirtualiZarr/issues/17)
- [Zarr-native on-disk chunk manifest format](https://github.com/zarr-developers/zarr-specs/issues/287)
- ["Virtual concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) of separate Zarr arrays
- ManifestArrays as an [intermediate layer in-memory](https://github.com/zarr-developers/VirtualiZarr/issues/71) in Zarr-Python
- [Separating CF-related Codecs from xarray](https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388)
- [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78)

If you see other opportunities then we would love to hear your ideas!
25 changes: 13 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,13 @@ VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the

## Aim

**NOTE: This package is in development. The usage examples in this section are currently aspirational.
See the [Usage docs page](#usage) to see what API works today. Progress towards making all of these examples work is tracked in [issue #2](https://github.com/TomNicholas/VirtualiZarr/issues/2).**

Let's say you have a bunch of legacy files (e.g. netCDF) which together tile to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).)
Let's say you have a bunch of legacy files (e.g. netCDF) which together tile along a dimension to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).)

```python
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='h5netcdf',
combine='by_coords', # 'by_coords' requires reading coord data to determine concatenation order
combine='nested',
)
ds # the complete lazy xarray dataset
```
Expand All @@ -38,18 +35,20 @@ However, you don't want to run this set of xarray operations every time you open

What's being cached here, you ask? We're effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly.

### Usage

Creating the virtual store looks very similar to how we normally open data with xarray:

```python
import virtualizarr # required for the xarray backend and accessor to be present
from virtualizarr import open_virtual_dataset

virtual_ds = xr.open_mfdataset(
'/my/files*.nc',
engine='virtualizarr', # virtualizarr registers an xarray IO backend that returns ManifestArray objects
combine='by_coords', # 'by_coords' stills requires actually reading coordinate data
)
virtual_datasets = [
open_virtual_dataset(filepath)
for filepath in glob.glob('/my/files*.nc')
]

virtual_ds # now wraps a bunch of virtual ManifestArray objects directly
# this Dataset wraps a bunch of virtual ManifestArray objects directly
virtual_ds = xr.combine_nested(virtual_datasets, concat_dim=['time'])

# cache the combined dataset pattern to disk, in this case using the existing kerchunk specification for reference files
virtual_ds.virtualize.to_kerchunk('combined.json', format='json')
Expand All @@ -68,6 +67,8 @@ ds = xr.open_dataset(m, engine='kerchunk', chunks={}) # normal xarray.Dataset o

No data has been loaded or copied in this process, we have merely created an on-disk lookup table that points xarray into the specific parts of the original netCDF files when it needs to read each chunk.

See the [Usage docs page](#usage) for more details.

## Licence

Apache 2.0
Expand Down
1 change: 1 addition & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Updated the development roadmap in preparation for v1.0. (:pull:`164`)
- Warn if user passes `indexes=None` to `open_virtual_dataset` to indicate that this is not yet fully supported.
(:pull:`170`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Clarify that virtual datasets cannot be treated like normal xarray datasets. (:issue:`173`)
Expand Down

0 comments on commit 91ebefe

Please sign in to comment.