Slow performance when open zarr file with numpy>2.0.0 #9545

renaudjester · 2024-09-25T10:03:15Z

What happened?

Hi!

I want to open a zarr dataset lazily.
On my computer:
With numpy==1.26.4 it takes around 1.5sec
With numpy==2.1.1 it takes around 5sec

It's also slow on an ubuntu machine.

Unfortunately, I don't really have the time to deep dive into the issue and pinpoint exactly what is the piece of code that takes much more time than before. As little as I tested, it doesn't seem to come from the http calls.

What did you expect to happen?

I expect that the time to lazily open the dataset is the same whatever the numpy version.

Minimal Complete Verifiable Example

import xarray
import time

top = time.time()
dataset = xarray.open_dataset(
    "https://s3.waw3-1.cloudferro.com/mdl-arco-time-035/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-h_202012/timeChunked.zarr",
    engine="zarr",
)
print(f"Took: {time.time() - top}s")
# with numpy==1.26.4: ~1s
# with numpy==2.1.1: ~5s

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

# numpy=2.1.1 INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 (main, Sep 23 2024, 17:37:36) [Clang 15.0.0 (clang-1500.3.9.4)] python-bits: 64 OS: Darwin OS-release: 23.6.0 machine: arm64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 2.1.1
scipy: None
netCDF4: 1.7.1.post2
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.1.0
pip: 24.0
conda: None
pytest: 8.3.3
mypy: None
IPython: 8.27.0
sphinx: None
None

# numpy==1.26.4 INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 (main, Sep 23 2024, 17:37:36) [Clang 15.0.0 (clang-1500.3.9.4)] python-bits: 64 OS: Darwin OS-release: 23.6.0 machine: arm64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 1.26.4
scipy: None
netCDF4: 1.7.1.post2
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.1.0
pip: 24.0
conda: None
pytest: 8.3.3
mypy: None
IPython: 8.27.0
sphinx: None
None

The text was updated successfully, but these errors were encountered:

keewis · 2024-09-25T10:31:43Z

this looks like an issue with time decoding, with decode_times=False both numpy<2 and numpy>=2 open in about 0.5 seconds for me.

It appears that with numpy>=2 we try to use cftime (which I didn't have installed in my test environment), and if I then pass use_cftime=False I get:

OutOfBoundsTimedelta: Cannot cast 45757440 from m to 'ns' without overflow.

(for reference, time values are encoded as minutes since 1900-01-01 in the "standard" calendar with dtype int32, the values are between 45757440 and 64471620)

I wonder if this has anything to do with the changed dtype casting rules in numpy>=2?

cc @spencerkclark, in case you have any insight here

spencerkclark · 2024-09-25T10:48:49Z

Thanks @keewis for taking a look—was this with xarray main? This seems reminiscent of #9498 (review), which we fixed in #9518. In other words due to an upstream pandas issue we inadvertently needed to fall back to cftime to decode times encoded with small integer dtypes with NumPy >= 2.0; #9518 should have restored the old behavior with a workaround.

keewis · 2024-09-25T11:50:33Z

actually, no. Let me retry with xarray main and get back to you

keewis · 2024-09-25T11:59:12Z

I can confirm that xarray main does not have that issue, which means the next release will fix this (or at least work around it).

Thanks for the well-written report, @renaudjester.

(as an additional comment, the link you've been using is actually on s3, which means that by using that protocol you can access the data a bit more efficiently:

xr.open_dataset(
    "s3://mdl-arco-time-035/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-h_202012/timeChunked.zarr",
    engine="zarr",
    storage_options={"endpoint_url": "https://s3.waw3-1.cloudferro.com", "anon": True},
)

spencerkclark · 2024-09-25T12:11:17Z

Great, thanks for confirming @keewis.

renaudjester · 2024-09-25T12:44:33Z

Thanks a lot :D

Super, I will wait for the next release then!

@keewis Thanks for the tips! Could you just point out to me why this is a bit more efficient?

keewis · 2024-09-25T12:58:10Z

as far as I can tell (and I'm by no means an expert on this), the S3 protocol is a REST API. This means that while it is possible to talk to it using just HTTP vocabulary, it doesn't allow you to be as precise when requesting data, so you'll have some overhead.

renaudjester added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 25, 2024

keewis removed the needs triage Issue that has not been reviewed by xarray team member label Sep 25, 2024

keewis closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance when open zarr file with numpy>2.0.0 #9545

Slow performance when open zarr file with numpy>2.0.0 #9545

renaudjester commented Sep 25, 2024 •

edited

Loading

keewis commented Sep 25, 2024 •

edited

Loading

spencerkclark commented Sep 25, 2024 •

edited

Loading

keewis commented Sep 25, 2024

keewis commented Sep 25, 2024 •

edited

Loading

spencerkclark commented Sep 25, 2024

renaudjester commented Sep 25, 2024

keewis commented Sep 25, 2024

Slow performance when open zarr file with numpy>2.0.0 #9545

Slow performance when open zarr file with numpy>2.0.0 #9545

Comments

renaudjester commented Sep 25, 2024 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

keewis commented Sep 25, 2024 • edited Loading

spencerkclark commented Sep 25, 2024 • edited Loading

keewis commented Sep 25, 2024

keewis commented Sep 25, 2024 • edited Loading

spencerkclark commented Sep 25, 2024

renaudjester commented Sep 25, 2024

keewis commented Sep 25, 2024

renaudjester commented Sep 25, 2024 •

edited

Loading

keewis commented Sep 25, 2024 •

edited

Loading

spencerkclark commented Sep 25, 2024 •

edited

Loading

keewis commented Sep 25, 2024 •

edited

Loading