Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformant ZarrV3 codecs and fill values #193

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
6b7abe2
Generate chunk manifest backed variable from HDF5 dataset.
sharkinsspatial Apr 19, 2024
bca0aab
Transfer dataset attrs to variable.
sharkinsspatial Apr 19, 2024
384ff6b
Get virtual variables dict from HDF5 file.
sharkinsspatial Apr 19, 2024
4c5f9bd
Update virtual_vars_from_hdf to use fsspec and drop_variables arg.
sharkinsspatial Apr 22, 2024
1dd3370
mypy fix to use ChunkKey and empty dimensions list.
sharkinsspatial Apr 22, 2024
d92c75c
Extract attributes from hdf5 root group.
sharkinsspatial Apr 22, 2024
0ed8362
Use hdf reader for netcdf4 files.
sharkinsspatial Apr 22, 2024
f4485fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 22, 2024
3cc1254
Merge branch 'main' into hdf5_reader
sharkinsspatial May 8, 2024
0123df7
Fix ruff complaints.
sharkinsspatial May 9, 2024
332bcaa
First steps for handling HDF5 filters.
sharkinsspatial May 10, 2024
c51e615
Initial step for hdf5plugin supported codecs.
sharkinsspatial May 13, 2024
0083f77
Small commit to check compression support in CI environment.
sharkinsspatial May 16, 2024
3c00071
Merge branch 'main' into hdf5_reader
sharkinsspatial May 18, 2024
207c4b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 19, 2024
c573800
Fix mypy complaints for hdf_filters.
sharkinsspatial May 19, 2024
ef0d7a8
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial May 19, 2024
588e06b
Local pre-commit fix for hdf_filters.
sharkinsspatial May 19, 2024
725333e
Use fsspec reader_options introduced in #37.
sharkinsspatial May 21, 2024
72df108
Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.
sharkinsspatial May 21, 2024
d1e85cb
Fix early return from hdf _extract_attrs.
sharkinsspatial May 21, 2024
1e2b343
Test that _extract_attrs correctly handles multiple attributes.
sharkinsspatial May 21, 2024
7f1c189
Initial attempt at scale and offset via numcodecs.
sharkinsspatial May 22, 2024
908e332
Tests for cfcodec_from_dataset.
sharkinsspatial May 23, 2024
0df332d
Temporarily relax integration tests to assert_allclose.
sharkinsspatial May 24, 2024
ca6b236
Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.
sharkinsspatial May 24, 2024
b7426c5
Check for compatability with netcdf4 engine.
sharkinsspatial May 24, 2024
dac21dd
Use separate fixtures for h5netcdf and netcdf4 compression styles.
sharkinsspatial May 27, 2024
e968772
Print libhdf5 and libnetcdf4 versions to confirm compiled environment.
sharkinsspatial May 27, 2024
9a98e57
Skip netcdf4 style compression tests when libhdf5 < 1.14.
sharkinsspatial May 27, 2024
7590b87
Include imagecodecs.numcodecs to support HDF5 lzf filters.
sharkinsspatial Jun 11, 2024
e9fbc8a
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 11, 2024
14bd709
Remove test that verifies call to read_kerchunk_references_from_file.
sharkinsspatial Jun 11, 2024
acdf0d7
Add additional codec support structures for imagecodecs and numcodecs.
sharkinsspatial Jun 12, 2024
4ba323a
Add codec config test for Zstd.
sharkinsspatial Jun 12, 2024
e14e53b
Include initial cf decoding tests.
sharkinsspatial Jun 21, 2024
b808ded
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 21, 2024
b052f8c
Revert typo for scale_factor retrieval.
sharkinsspatial Jun 21, 2024
01a3980
Update reader to use new numpy manifest representation.
sharkinsspatial Jun 21, 2024
c37d9e5
Temporarily skip test until blosc netcdf4 issue is solved.
sharkinsspatial Jun 22, 2024
17b30d4
Fix Pydantic 2 migration warnings.
sharkinsspatial Jun 22, 2024
f6b596a
Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.
sharkinsspatial Jun 22, 2024
eb6e24d
Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.
sharkinsspatial Jun 22, 2024
c85bd16
Mamba attempt with latest imagecodecs release.
sharkinsspatial Jun 22, 2024
ca435da
Use correct iter_chunks callback function signtature.
sharkinsspatial Jun 26, 2024
3017951
Include pip based imagecodecs-numcodecs until conda-forge availability.
sharkinsspatial Jun 26, 2024
ccf0b73
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 26, 2024
32ba135
Handle non-coordinate dims which are serialized to hdf as empty dataset.
sharkinsspatial Jun 27, 2024
64f446c
Use reader_options for filetype check and update failing kerchunk call.
sharkinsspatial Jun 27, 2024
1c590bb
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 27, 2024
9797346
Fix chunkmanifest shaping for chunked datasets.
sharkinsspatial Jun 30, 2024
c833e19
Handle scale_factor attribute serialization for compressed files.
sharkinsspatial Jun 30, 2024
701bcfa
Include chunked roundtrip fixture.
sharkinsspatial Jun 30, 2024
08c988e
Standardize xarray integration tests for hdf filters.
sharkinsspatial Jun 30, 2024
e6076bd
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial Jun 30, 2024
d684a84
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 30, 2024
4cb4bac
Update reader selection logic for new filetype determination.
sharkinsspatial Jun 30, 2024
d352104
Use decode_times for integration test.
sharkinsspatial Jun 30, 2024
3d89ea4
Standardize fixture names for hdf5 vs netcdf4 file types.
sharkinsspatial Jun 30, 2024
c9dd0d9
Handle array add_offset property for compressed data.
sharkinsspatial Jul 1, 2024
db5b421
Include h5py shuffle filter.
sharkinsspatial Jul 1, 2024
9a1da32
Make ScaleAndOffset codec last in filters list.
sharkinsspatial Jul 1, 2024
9b2b0f8
Apply ScaleAndOffset codec to _FillValue since it's value is now down…
sharkinsspatial Jul 2, 2024
9ef1362
Coerce scale and add_offset values to native float for JSON serializa…
sharkinsspatial Jul 2, 2024
eb16bc1
Conformant ZarrV3 codecs
ghidalgo3 Jul 17, 2024
5f1b7f9
Update docs
ghidalgo3 Jul 17, 2024
519d45d
Update virtualizarr/zarr.py
ghidalgo3 Jul 18, 2024
76e9c8e
Update virtualizarr/zarr.py
ghidalgo3 Jul 18, 2024
000c520
Change default_fill to 0s
ghidalgo3 Jul 18, 2024
25d04b9
Merge branch 'guhidalgo/fixmetadatacodecs' of https://github.com/ghid…
ghidalgo3 Jul 18, 2024
c2e7279
Generate permutation
ghidalgo3 Jul 18, 2024
145960a
Pythonic isinstance check
ghidalgo3 Jul 18, 2024
c051f04
Add return type to isconfigurable
ghidalgo3 Jul 18, 2024
7a65fbd
Merge remote-tracking branch 'upstream/hdf5_reader' into codecs
Jul 18, 2024
7b09324
Changes from pair programming for zarrv3 to kerchunk file reading
Jul 19, 2024
2c59256
Revert "Merge remote-tracking branch 'upstream/hdf5_reader' into codecs"
Jul 19, 2024
50c3dcd
Fix unit tests
ghidalgo3 Jul 19, 2024
ab97e63
PR comments
ghidalgo3 Jul 22, 2024
0be0728
Remove kwarg in dict default
ghidalgo3 Jul 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ New Features
Breaking changes
~~~~~~~~~~~~~~~~

- Serialize valid ZarrV3 metadata and require full compressor numcodec config (for :pull:`193`)
By `Gustavo Hidalgo <https://github.com/ghidalgo3>`_.

Deprecations
~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def variable_to_kerchunk_arr_refs(var: xr.Variable, var_name: str) -> KerchunkAr
for chunk_key, entry in marr.manifest.dict().items()
}

zarray = marr.zarray
zarray = marr.zarray.replace(zarr_format=2)

else:
try:
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ def create_manifestarray(

zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "blosc", "clevel": 5, "cname": "lz4", "shuffle": 1},
dtype=np.dtype("float32"),
fill_value=0.0, # TODO change this to NaN?
fill_value=0.0,
filters=None,
order="C",
shape=shape,
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def test_non_dimension_coordinates(self, tmpdir, format):
# regression test for GH issue #105

# set up example xarray dataset containing non-dimension coordinate variables
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6).reshape(2, 3))})
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6.0).reshape(2, 3))})

# save it to disk as netCDF (in temporary directory)
ds.to_netcdf(f"{tmpdir}/non_dim_coords.nc")
Expand Down
12 changes: 6 additions & 6 deletions virtualizarr/tests/test_manifests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def test_create_manifestarray(self):
shape = (5, 2, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -74,7 +74,7 @@ def test_equals(self):
shape = (5, 2, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand All @@ -95,7 +95,7 @@ def test_not_equal_chunk_entries(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -209,7 +209,7 @@ def test_concat(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -254,7 +254,7 @@ def test_stack(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -299,7 +299,7 @@ def test_refuse_combine():

zarray_common = {
"chunks": (5, 1, 10),
"compressor": "zlib",
"compressor": {"id": "zlib", "level": 1},
"dtype": np.dtype("int32"),
"fill_value": 0.0,
"filters": None,
Expand Down
10 changes: 5 additions & 5 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def test_wrapping():
dtype = np.dtype("int32")
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=dtype,
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -49,7 +49,7 @@ def test_equals(self):
shape = (5, 20)
zarray = ZArray(
chunks=chunks,
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -86,7 +86,7 @@ def test_concat_along_existing_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(1, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -133,7 +133,7 @@ def test_concat_along_new_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(5, 10),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down Expand Up @@ -183,7 +183,7 @@ def test_concat_dim_coords_along_existing_dim(self):
# both manifest arrays in this example have the same zarray properties
zarray = ZArray(
chunks=(10,),
compressor="zlib",
compressor={"id": "zlib", "level": 1},
dtype=np.dtype("int32"),
fill_value=0.0,
filters=None,
Expand Down
60 changes: 54 additions & 6 deletions virtualizarr/tests/test_zarr.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
import json

import numpy as np
import pytest
import xarray as xr
import xarray.testing as xrt

from virtualizarr import ManifestArray, open_virtual_dataset
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.zarr import dataset_to_zarr, metadata_from_zarr_json


def test_zarr_v3_roundtrip(tmpdir):
@pytest.fixture
def vds_with_manifest_arrays() -> xr.Dataset:
arr = ManifestArray(
chunkmanifest=ChunkManifest(
entries={"0.0": dict(path="test.nc", offset=6144, length=48)}
Expand All @@ -15,18 +20,61 @@ def test_zarr_v3_roundtrip(tmpdir):
shape=(2, 3),
dtype=np.dtype("<i8"),
chunks=(2, 3),
compressor=None,
compressor={"id": "zlib", "level": 1},
filters=None,
fill_value=np.nan,
fill_value=0,
order="C",
zarr_format=3,
),
)
original = xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})
return xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})


def isconfigurable(value: dict) -> bool:
"""
Several metadata attributes in ZarrV3 use a dictionary with keys "name" : str and "configuration" : dict
"""
return "name" in value and "configuration" in value

original.virtualize.to_zarr(tmpdir / "store.zarr")

def test_zarr_v3_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
vds_with_manifest_arrays.virtualize.to_zarr(tmpdir / "store.zarr")
roundtrip = open_virtual_dataset(
tmpdir / "store.zarr", filetype="zarr_v3", indexes={}
)

xrt.assert_identical(roundtrip, original)
xrt.assert_identical(roundtrip, vds_with_manifest_arrays)


def test_metadata_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
zarray, _, _ = metadata_from_zarr_json(tmpdir / "store.zarr/a/zarr.json")
assert zarray == vds_with_manifest_arrays.a.data.zarray


def test_zarr_v3_metadata_conformance(tmpdir, vds_with_manifest_arrays: xr.Dataset):
"""
Checks that the output metadata of an array variable conforms to this spec
for the required attributes:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#metadata
"""
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
# read the a variable's metadata
with open(tmpdir / "store.zarr/a/zarr.json", mode="r") as f:
metadata = json.loads(f.read())
assert metadata["zarr_format"] == 3
assert metadata["node_type"] == "array"
assert isinstance(metadata["shape"], list) and all(
isinstance(dim, int) for dim in metadata["shape"]
)
assert isinstance(metadata["data_type"], str) or isconfigurable(
metadata["data_type"]
)
assert isconfigurable(metadata["chunk_grid"])
assert isconfigurable(metadata["chunk_key_encoding"])
assert isinstance(metadata["fill_value"], (bool, int, float, str, list))
assert (
isinstance(metadata["codecs"], list)
and len(metadata["codecs"]) > 1
and all(isconfigurable(codec) for codec in metadata["codecs"])
)
Loading
Loading