Handle scalar dataset variables #205

ghidalgo3 · 2024-07-30T00:54:54Z

Some real-world NetCDF files have scalar variables that kerchunk parsers to a ChunkDict of {}. Previously, these variables would need to be explicitly loaded or dropped to make a open_virtual_dataset call work. With this change, the variable will be loaded into xr.Variables of dimension 0 and their attributes passed up to the XArray dataset.

Closes NetCDF files with many empty variables require large drop_variables declaration #194
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst

TomNicholas · 2024-07-30T02:12:00Z

virtualizarr/tests/test_xarray.py

@@ -419,6 +419,32 @@ def test_open_virtual_dataset_passes_expected_args(
        }
        mock_read_kerchunk.assert_called_once_with(**args)

+    @patch("virtualizarr.kerchunk.parse_array_refs")


Wait what exactly is this mocking doing?

I couldn't get a repro of the issue by just creating an xarray dataset with a 0-dim variable so I had to debug the problem I was getting on a real netCDF file and mock out the return value of parse_array_refs to match what I was seeing. Here's a repro if you'd like to give it shot:

from urllib.parse import urlparse import xarray as xr import planetary_computer import virtualizarr def _href_to_fsspec(href: str) -> str: url = urlparse(href) return f"az://{url.path.lstrip('/')}" def open_planetary_computer_netcdf(href): account_name = href.split("/")[2].split(".")[0] container_name = href.split("/")[3] fs = planetary_computer.get_adlfs_filesystem(account_name, container_name) return fs.open(_href_to_fsspec(href)) def _get_storage_options(href: str, use_sas_as_credential: bool = False) -> dict: url = urlparse(href) if use_sas_as_credential: return { "account_name": url.hostname.split(".")[0], "account_key": url.query, "anon": False, } else: return { "account_name": url.hostname.split(".")[0], "anon": False, } href = "https://landcoverdata.blob.core.windows.net/esa-cci-lc/netcdf/C3S-LC-L4-LCCS-Map-300m-P1Y-2016-v2.1.1.nc" # Next line works, you can inspect the `crs` variable ds = xr.open_dataset(open_planetary_computer_netcdf(href)) # Next line will raise a `ValueError` unpacking a list of no elements when handling the `crs` kerchunk attrs vds = virtualizarr.open_virtual_dataset( _href_to_fsspec(href), reader_options={ "storage_options": _get_storage_options(planetary_computer.sign(href), use_sas_as_credential=True) })

@ghidalgo3 I just met up with some folks from the HDF group at a conference last week so I can ask them the proper incantation to use so we can create an hdf fixture with a scalar 👍

Thank you for the explanation! I'm happy to merge this as-is or wait if you wanted to test it some other way.

sharkinsspatial · 2024-08-02T02:30:09Z

@ghidalgo3 Here is an HDF5 fixture you could use for direct testing of this behavior without mocking.

@pytest.fixture
def hdf5_empty(tmpdir):
    filepath = f"{tmpdir}/empty.nc"
    f = h5py.File(filepath, "w")
    f.create_dataset("empty", shape=(), dtype="float32")
    return filepath

Note that I used "empty" terminology here rather than "scalar".
I'm feeling a little under the weather so my thinking might not be totally clear 😆 but I think that in the context of this PR and for the original issue we may want to be pedantic and differentiate these 2 concepts. I think an empty or "dummy" HDF dataset that is used only to store attribute information is different from a scalar HDF dataset that is actually storing a single non-array value like

@pytest.fixture
def hdf5_scalar(tmpdir):
    filepath = f"{tmpdir}/scalar.nc"
    f = h5py.File(filepath, "w")
    f.create_dataset("scalar", data=0.1, dtype="float32")
    return filepath

Which still has a shape = () but whose data property would be a valid ManifestArray.

ghidalgo3 · 2024-08-02T04:00:06Z

Thanks @sharkinsspatial for the fixtures! I added test cases for both cases without that ugly mock. @TomNicholas if it looks good to you, it's ready to merge.

UPDATE: in case you see this in the next 30 minutes, let me just check that the data value is preserved for the scalar case.
Hmm this broken writing the array metadata to disk, let me look at this tomorrow with fresh eyes.

ghidalgo3 · 2024-08-02T19:22:46Z

I'm running into the problem described by #62 but I think I don't want to solve it in this PR.

You can open files with scalar variables, and empty variables, but if you call vds.virtualize.to_zarr("store.zarr") then that will error because one of the variables is a numpy array and not a ManifestArray. Thats the behavior today anyway, and should be solved once #62 is implemented. If you truly must write the vds to disk, just skip loading the variables with drop_variables.

Given that, this PR is good to merge.

TomNicholas · 2024-08-02T20:54:38Z

conftest.py

@@ -1,3 +1,4 @@
+import h5py


I think h5py should be added to the test requirements?

Yes, good catch. Is there a way to make the CI conda install be constructed starting from the test extra + additional conda packages?

Probably, but let's do that in a follow-up PR.

Handle scalar dataset variables

3b091e8

ghidalgo3 temporarily deployed to test-release July 30, 2024 00:55 — with GitHub Actions Inactive

Doc update

93749d2

ghidalgo3 marked this pull request as ready for review July 30, 2024 00:56

ghidalgo3 temporarily deployed to test-release July 30, 2024 00:57 — with GitHub Actions Inactive

TomNicholas reviewed Jul 30, 2024

View reviewed changes

add empty and scalar hdf5 fixtures

b28ca53

ghidalgo3 temporarily deployed to test-release August 2, 2024 03:57 — with GitHub Actions Inactive

TomNicholas reviewed Aug 2, 2024

View reviewed changes

add h5py to test requirements

f80cc42

ghidalgo3 temporarily deployed to test-release August 2, 2024 21:15 — with GitHub Actions Inactive

TomNicholas merged commit b34a1ee into zarr-developers:main Aug 2, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle scalar dataset variables #205

Handle scalar dataset variables #205

ghidalgo3 commented Jul 30, 2024 •

edited

Loading

TomNicholas Jul 30, 2024

ghidalgo3 Jul 30, 2024

sharkinsspatial Jul 30, 2024

TomNicholas Aug 1, 2024

sharkinsspatial commented Aug 2, 2024 •

edited

Loading

ghidalgo3 commented Aug 2, 2024 •

edited

Loading

ghidalgo3 commented Aug 2, 2024

TomNicholas Aug 2, 2024

ghidalgo3 Aug 2, 2024

TomNicholas Aug 2, 2024

Handle scalar dataset variables #205

Handle scalar dataset variables #205

Conversation

ghidalgo3 commented Jul 30, 2024 • edited Loading

TomNicholas Jul 30, 2024

Choose a reason for hiding this comment

ghidalgo3 Jul 30, 2024

Choose a reason for hiding this comment

sharkinsspatial Jul 30, 2024

Choose a reason for hiding this comment

TomNicholas Aug 1, 2024

Choose a reason for hiding this comment

sharkinsspatial commented Aug 2, 2024 • edited Loading

ghidalgo3 commented Aug 2, 2024 • edited Loading

ghidalgo3 commented Aug 2, 2024

TomNicholas Aug 2, 2024

Choose a reason for hiding this comment

ghidalgo3 Aug 2, 2024

Choose a reason for hiding this comment

TomNicholas Aug 2, 2024

Choose a reason for hiding this comment

ghidalgo3 commented Jul 30, 2024 •

edited

Loading

sharkinsspatial commented Aug 2, 2024 •

edited

Loading

ghidalgo3 commented Aug 2, 2024 •

edited

Loading