Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle scalar dataset variables #205

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

ghidalgo3
Copy link
Contributor

@ghidalgo3 ghidalgo3 commented Jul 30, 2024

Some real-world NetCDF files have scalar variables that kerchunk parsers to a ChunkDict of {}. Previously, these variables would need to be explicitly loaded or dropped to make a open_virtual_dataset call work. With this change, the variable will be loaded into xr.Variables of dimension 0 and their attributes passed up to the XArray dataset.

@ghidalgo3 ghidalgo3 marked this pull request as ready for review July 30, 2024 00:56
@@ -419,6 +419,32 @@ def test_open_virtual_dataset_passes_expected_args(
}
mock_read_kerchunk.assert_called_once_with(**args)

@patch("virtualizarr.kerchunk.parse_array_refs")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait what exactly is this mocking doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get a repro of the issue by just creating an xarray dataset with a 0-dim variable so I had to debug the problem I was getting on a real netCDF file and mock out the return value of parse_array_refs to match what I was seeing. Here's a repro if you'd like to give it shot:

from urllib.parse import urlparse
import xarray as xr
import planetary_computer
import virtualizarr

def _href_to_fsspec(href: str) -> str:
    url = urlparse(href)
    return f"az://{url.path.lstrip('/')}"

def open_planetary_computer_netcdf(href):
    account_name = href.split("/")[2].split(".")[0]
    container_name = href.split("/")[3]
    fs = planetary_computer.get_adlfs_filesystem(account_name, container_name)
    return fs.open(_href_to_fsspec(href))

def _get_storage_options(href: str, use_sas_as_credential: bool = False) -> dict:
    url = urlparse(href)
    if use_sas_as_credential:
        return {
            "account_name": url.hostname.split(".")[0],
            "account_key": url.query,
            "anon": False,
        }
    else:
        return {
        "account_name": url.hostname.split(".")[0],
        "anon": False,
    }


href = "https://landcoverdata.blob.core.windows.net/esa-cci-lc/netcdf/C3S-LC-L4-LCCS-Map-300m-P1Y-2016-v2.1.1.nc"
# Next line works, you can inspect the `crs` variable
ds = xr.open_dataset(open_planetary_computer_netcdf(href))
# Next line will raise a `ValueError` unpacking a list of no elements when handling the `crs` kerchunk attrs
vds = virtualizarr.open_virtual_dataset(
    _href_to_fsspec(href),
    reader_options={
        "storage_options": _get_storage_options(planetary_computer.sign(href), use_sas_as_credential=True)
    })

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghidalgo3 I just met up with some folks from the HDF group at a conference last week so I can ask them the proper incantation to use so we can create an hdf fixture with a scalar 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation! I'm happy to merge this as-is or wait if you wanted to test it some other way.

@sharkinsspatial
Copy link
Collaborator

sharkinsspatial commented Aug 2, 2024

@ghidalgo3 Here is an HDF5 fixture you could use for direct testing of this behavior without mocking.

@pytest.fixture
def hdf5_empty(tmpdir):
    filepath = f"{tmpdir}/empty.nc"
    f = h5py.File(filepath, "w")
    f.create_dataset("empty", shape=(), dtype="float32")
    return filepath

Note that I used "empty" terminology here rather than "scalar".
I'm feeling a little under the weather so my thinking might not be totally clear 😆 but I think that in the context of this PR and for the original issue we may want to be pedantic and differentiate these 2 concepts. I think an empty or "dummy" HDF dataset that is used only to store attribute information is different from a scalar HDF dataset that is actually storing a single non-array value like

@pytest.fixture
def hdf5_scalar(tmpdir):
    filepath = f"{tmpdir}/scalar.nc"
    f = h5py.File(filepath, "w")
    f.create_dataset("scalar", data=0.1, dtype="float32")
    return filepath

Which still has a shape = () but whose data property would be a valid ManifestArray.

@ghidalgo3
Copy link
Contributor Author

ghidalgo3 commented Aug 2, 2024

Thanks @sharkinsspatial for the fixtures! I added test cases for both cases without that ugly mock. @TomNicholas if it looks good to you, it's ready to merge.

UPDATE: in case you see this in the next 30 minutes, let me just check that the data value is preserved for the scalar case.
Hmm this broken writing the array metadata to disk, let me look at this tomorrow with fresh eyes.

@ghidalgo3
Copy link
Contributor Author

I'm running into the problem described by #62 but I think I don't want to solve it in this PR.

You can open files with scalar variables, and empty variables, but if you call vds.virtualize.to_zarr("store.zarr") then that will error because one of the variables is a numpy array and not a ManifestArray. Thats the behavior today anyway, and should be solved once #62 is implemented. If you truly must write the vds to disk, just skip loading the variables with drop_variables.

Given that, this PR is good to merge.

@@ -1,3 +1,4 @@
import h5py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think h5py should be added to the test requirements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Is there a way to make the CI conda install be constructed starting from the test extra + additional conda packages?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but let's do that in a follow-up PR.

@TomNicholas TomNicholas merged commit b34a1ee into zarr-developers:main Aug 2, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NetCDF files with many empty variables require large drop_variables declaration
3 participants