Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect error raised for chunked Zarr region write #9557

Closed
5 tasks done
shoyer opened this issue Sep 30, 2024 · 3 comments
Closed
5 tasks done

Incorrect error raised for chunked Zarr region write #9557

shoyer opened this issue Sep 30, 2024 · 3 comments
Labels
bug topic-zarr Related to zarr storage library

Comments

@shoyer
Copy link
Member

shoyer commented Sep 30, 2024

What happened?

Writing a chunk with to_zarr() on a specific region incorrectly fails if a variable is chunked with Dask, even if the variable's chunks are compatible with the Zarr store.

What did you expect to happen?

This code path is used by Xarray-Beam. In particular, this test in Xarray-Beam fails with the latest development version of Xarray.

Minimal Complete Verifiable Example

import xarray
import numpy as np

data = np.random.RandomState(0).randn(2920, 25, 53)
ds = xarray.Dataset({'temperature': (('time', 'lat', 'lon'), data)})
chunks = {'time': 1000, 'lat': 25, 'lon': 53}

store = 'testing.zarr'
ds.chunk(chunks).to_zarr(store, compute=False)

region = {'time': slice(1000, 2000, 1)}
chunk = ds.isel(region)
chunk = chunk.chunk()  # triggers error
chunk.chunk().to_zarr(store, region=region)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

ValueError                                Traceback (most recent call last)
<ipython-input-18-ce32e8ce49e1> in <cell line: 1>()
----> 1 chunk.chunk().to_zarr(store, region=region)

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   2570         from xarray.backends.api import to_zarr
   2571 
-> 2572         return to_zarr(  # type: ignore[call-overload,misc]
   2573             self,
   2574             store=store,

/usr/local/lib/python3.10/dist-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options, zarr_version, write_empty_chunks, chunkmanager_store_kwargs)
   1782     writer = ArrayWriter()
   1783     # TODO: figure out how to properly handle unlimited_dims
-> 1784     dump_to_store(dataset, zstore, writer, encoding=encoding)
   1785     writes = writer.sync(
   1786         compute=compute, chunkmanager_store_kwargs=chunkmanager_store_kwargs

/usr/local/lib/python3.10/dist-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1465         variables, attrs = encoder(variables, attrs)
   1466 
-> 1467     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1468 
   1469 

/usr/local/lib/python3.10/dist-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    784             variables_to_set = variables_encoded
    785 
--> 786         self.set_variables(
    787             variables_to_set, check_encoding_set, writer, unlimited_dims=unlimited_dims
    788         )

/usr/local/lib/python3.10/dist-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    882             # Note: Ideally there should be two functions, one for validating the chunks and
    883             # another one for extracting the encoding.
--> 884             encoding = extract_zarr_variable_encoding(
    885                 v,
    886                 raise_on_invalid=vn in check_encoding_set,

/usr/local/lib/python3.10/dist-packages/xarray/backends/zarr.py in extract_zarr_variable_encoding(variable, raise_on_invalid, name, safe_chunks, region, mode, shape)
    341                 del encoding[k]
    342 
--> 343     chunks = _determine_zarr_chunks(
    344         enc_chunks=encoding.get("chunks"),
    345         var_chunks=variable.chunks,

/usr/local/lib/python3.10/dist-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name, safe_chunks, region, mode, shape)
    243                     # is equal to the size of the last chunk
    244                     if dchunks[-1] % zchunk != size % zchunk:
--> 245                         raise ValueError(base_error)
    246                 elif dchunks[-1] % zchunk:
    247                     raise ValueError(base_error)

ValueError: Specified zarr chunks encoding['chunks']=(1000, 25, 53) for variable named 'temperature' would overlap multiple dask chunks ((1000,), (25,), (53,)) on the region (slice(1000, 2000, 1), slice(None, None, None), slice(None, None, None)). Writing this array in parallel with dask could lead to corrupted data.Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

Anything else we need to know?

These error messages were introduced by #9527

Environment

@shoyer shoyer added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 30, 2024
@shoyer shoyer added topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 30, 2024
@max-sixty
Copy link
Collaborator

Right, sorry, I guess the logic for checking chunks isn't quite right. Thanks a lot for spotting.

This passes if we're only writing to the first chunk rather than the second:

import xarray
import numpy as np

data = np.random.RandomState(0).randn(2920, 25, 53)
ds = xarray.Dataset({'temperature': (('time', 'lat', 'lon'), data)})
chunks = {'time': 1000, 'lat': 25, 'lon': 53}

store = 'testing.zarr'
ds.chunk(chunks).to_zarr(store, compute=False)

# region = {'time': slice(1000, 2000, 1)}
region = {'time': slice(0, 1000, 1)}
chunk = ds.isel(region)
chunk = chunk.chunk()  # triggers error
chunk.chunk().to_zarr(store, region=region)

...so somehow there's a small logic error there

@josephnowak
Copy link
Contributor

Hi, I will take a look at this case, as a workaround you can use the mode = "a", but I think that the problem is that the algorithm is detecting that the region is covering the last chunk which is incorrect for this case I think.

@josephnowak
Copy link
Contributor

Indeed, it was an error in the validation of the last chunk. I'm sorry. I sent a new PR to fix the bug #9559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

3 participants