Skip to content

Commit

Permalink
Merge branch 'guhidalgo/replaceinternalzarray' of https://github.com/…
Browse files Browse the repository at this point in the history
…ghidalgo3/VirtualiZarr into guhidalgo/replaceinternalzarray
  • Loading branch information
ghidalgo3 committed Jul 10, 2024
2 parents 5fa1dea + c8c9020 commit cea2214
Show file tree
Hide file tree
Showing 5 changed files with 90 additions and 19 deletions.
36 changes: 33 additions & 3 deletions docs/releases.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,37 @@
Release notes
=============

.. _v0.2:
.. _v1.0.1:

v0.2 (unreleased)
-----------------
v1.0.1 (unreleased)
-------------------

New Features
~~~~~~~~~~~~

Breaking changes
~~~~~~~~~~~~~~~~

Deprecations
~~~~~~~~~~~~

Bug fixes
~~~~~~~~~

Documentation
~~~~~~~~~~~~~

Internal Changes
~~~~~~~~~~~~~~~~

.. _v1.0.0:

v1.0.0 (9th July 2024)
----------------------

This release marks VirtualiZarr as mostly feature-complete, in the sense of achieving feature parity with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.

Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages. See the roadmap in the documentation for details.

New Features
~~~~~~~~~~~~
Expand Down Expand Up @@ -39,7 +66,10 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Added example of using cftime_variables to usage docs. (:issue:`169`, :pull:`174`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Updated the development roadmap in preparation for v1.0. (:pull:`164`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Warn if user passes `indexes=None` to `open_virtual_dataset` to indicate that this is not yet fully supported.
(:pull:`170`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Clarify that virtual datasets cannot be treated like normal xarray datasets. (:issue:`173`)
Expand Down
37 changes: 34 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,15 +298,15 @@ TODO: Use preprocess to create a new index from the metadata
Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:

```python
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'])
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
* time (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Expand All @@ -316,13 +316,44 @@ Attributes:
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects, and loadable variables backed by (lazy) numpy arrays.
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).

Loading variables can be useful in a few scenarios:
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
3. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.

### CF-encoded time variables

Notice that the `time` variable that was loaded above does not have the expected dtype. To correctly decode time variables according to the CF conventions (like `xr.open_dataset` does by default), you need to include them in an additional keyword argument `cftime_variables`:

```python
vds = open_virtual_dataset(
'air.nc',
loadable_variables=['air', 'time'],
cftime_variables=['time'],
indexes={},
)
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

Now the loaded time variable has a `datetime64[ns]` dtype. Any variables listed as `cftime_variables` must also be listed as `loadable_variables`.

## Writing virtual stores to disk

Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,8 +166,8 @@ def find_var_names(ds_reference_dict: KerchunkStoreRefs) -> list[str]:
"""Find the names of zarr variables in this store/group."""

refs = ds_reference_dict["refs"]
found_var_names = [key.split("/")[0] for key in refs.keys() if "/" in key]
return found_var_names
found_var_names = {key.split("/")[0] for key in refs.keys() if "/" in key}
return list(found_var_names)


def extract_array_refs(
Expand Down
20 changes: 10 additions & 10 deletions virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@ class ChunkManifest:
"""

_paths: np.ndarray[Any, np.dtypes.StringDType] # type: ignore[name-defined]
_offsets: np.ndarray[Any, np.dtype[np.int32]]
_lengths: np.ndarray[Any, np.dtype[np.int32]]
_offsets: np.ndarray[Any, np.dtype[np.uint64]]
_lengths: np.ndarray[Any, np.dtype[np.uint64]]

def __init__(self, entries: dict) -> None:
"""
Expand Down Expand Up @@ -100,8 +100,8 @@ def __init__(self, entries: dict) -> None:

# Initializing to empty implies that entries with path='' are treated as missing chunks
paths = np.empty(shape=shape, dtype=np.dtypes.StringDType()) # type: ignore[attr-defined]
offsets = np.empty(shape=shape, dtype=np.dtype("int32"))
lengths = np.empty(shape=shape, dtype=np.dtype("int32"))
offsets = np.empty(shape=shape, dtype=np.dtype("uint64"))
lengths = np.empty(shape=shape, dtype=np.dtype("uint64"))

# populate the arrays
for key, entry in entries.items():
Expand All @@ -128,8 +128,8 @@ def __init__(self, entries: dict) -> None:
def from_arrays(
cls,
paths: np.ndarray[Any, np.dtype[np.dtypes.StringDType]], # type: ignore[name-defined]
offsets: np.ndarray[Any, np.dtype[np.int32]],
lengths: np.ndarray[Any, np.dtype[np.int32]],
offsets: np.ndarray[Any, np.dtype[np.uint64]],
lengths: np.ndarray[Any, np.dtype[np.uint64]],
) -> "ChunkManifest":
"""
Create manifest directly from numpy arrays containing the path and byte range information.
Expand Down Expand Up @@ -161,13 +161,13 @@ def from_arrays(
raise ValueError(
f"paths array must have a numpy variable-length string dtype, but got dtype {paths.dtype}"
)
if offsets.dtype != np.dtype("int32"):
if offsets.dtype != np.dtype("uint64"):
raise ValueError(
f"offsets array must have 32-bit integer dtype, but got dtype {offsets.dtype}"
f"offsets array must have 64-bit unsigned integer dtype, but got dtype {offsets.dtype}"
)
if lengths.dtype != np.dtype("int32"):
if lengths.dtype != np.dtype("uint64"):
raise ValueError(
f"lengths array must have 32-bit integer dtype, but got dtype {lengths.dtype}"
f"lengths array must have 64-bit unsigned integer dtype, but got dtype {lengths.dtype}"
)

# check shapes
Expand Down
12 changes: 11 additions & 1 deletion virtualizarr/tests/test_kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
import xarray as xr
import xarray.testing as xrt

from virtualizarr.kerchunk import FileType, _automatically_determine_filetype
from virtualizarr.kerchunk import (
FileType,
_automatically_determine_filetype,
find_var_names,
)
from virtualizarr.manifests import ChunkManifest, ManifestArray
from virtualizarr.xarray import dataset_from_kerchunk_refs

Expand Down Expand Up @@ -266,3 +270,9 @@ def test_FileType():
assert "zarr" == FileType("zarr").name
with pytest.raises(ValueError):
FileType(None)


def test_no_duplicates_find_var_names():
"""Verify that we get a deduplicated list of var names"""
ref_dict = {"refs": {"x/something": {}, "x/otherthing": {}}}
assert len(find_var_names(ref_dict)) == 1

0 comments on commit cea2214

Please sign in to comment.