Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance when open zarr file with numpy>2.0.0 #9545

Closed
5 tasks done
renaudjester opened this issue Sep 25, 2024 · 7 comments
Closed
5 tasks done

Slow performance when open zarr file with numpy>2.0.0 #9545

renaudjester opened this issue Sep 25, 2024 · 7 comments
Labels

Comments

@renaudjester
Copy link

renaudjester commented Sep 25, 2024

What happened?

Hi!

I want to open a zarr dataset lazily.
On my computer:
With numpy==1.26.4 it takes around 1.5sec
With numpy==2.1.1 it takes around 5sec

It's also slow on an ubuntu machine.

Unfortunately, I don't really have the time to deep dive into the issue and pinpoint exactly what is the piece of code that takes much more time than before. As little as I tested, it doesn't seem to come from the http calls.

What did you expect to happen?

I expect that the time to lazily open the dataset is the same whatever the numpy version.

Minimal Complete Verifiable Example

import xarray
import time

top = time.time()
dataset = xarray.open_dataset(
    "https://s3.waw3-1.cloudferro.com/mdl-arco-time-035/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-h_202012/timeChunked.zarr",
    engine="zarr",
)
print(f"Took: {time.time() - top}s")
# with numpy==1.26.4: ~1s
# with numpy==2.1.1: ~5s

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

# numpy=2.1.1 INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 (main, Sep 23 2024, 17:37:36) [Clang 15.0.0 (clang-1500.3.9.4)] python-bits: 64 OS: Darwin OS-release: 23.6.0 machine: arm64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 2.1.1
scipy: None
netCDF4: 1.7.1.post2
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.1.0
pip: 24.0
conda: None
pytest: 8.3.3
mypy: None
IPython: 8.27.0
sphinx: None
None

# numpy==1.26.4 INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 (main, Sep 23 2024, 17:37:36) [Clang 15.0.0 (clang-1500.3.9.4)] python-bits: 64 OS: Darwin OS-release: 23.6.0 machine: arm64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 1.26.4
scipy: None
netCDF4: 1.7.1.post2
pydap: None
h5netcdf: None
h5py: None
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.9.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.9.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.1.0
pip: 24.0
conda: None
pytest: 8.3.3
mypy: None
IPython: 8.27.0
sphinx: None
None

@renaudjester renaudjester added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 25, 2024
@keewis keewis removed the needs triage Issue that has not been reviewed by xarray team member label Sep 25, 2024
@keewis
Copy link
Collaborator

keewis commented Sep 25, 2024

this looks like an issue with time decoding, with decode_times=False both numpy<2 and numpy>=2 open in about 0.5 seconds for me.

It appears that with numpy>=2 we try to use cftime (which I didn't have installed in my test environment), and if I then pass use_cftime=False I get:

OutOfBoundsTimedelta: Cannot cast 45757440 from m to 'ns' without overflow.

(for reference, time values are encoded as minutes since 1900-01-01 in the "standard" calendar with dtype int32, the values are between 45757440 and 64471620)

I wonder if this has anything to do with the changed dtype casting rules in numpy>=2?

cc @spencerkclark, in case you have any insight here

@spencerkclark
Copy link
Member

spencerkclark commented Sep 25, 2024

Thanks @keewis for taking a look—was this with xarray main? This seems reminiscent of #9498 (review), which we fixed in #9518. In other words due to an upstream pandas issue we inadvertently needed to fall back to cftime to decode times encoded with small integer dtypes with NumPy >= 2.0; #9518 should have restored the old behavior with a workaround.

@keewis
Copy link
Collaborator

keewis commented Sep 25, 2024

actually, no. Let me retry with xarray main and get back to you

@keewis
Copy link
Collaborator

keewis commented Sep 25, 2024

I can confirm that xarray main does not have that issue, which means the next release will fix this (or at least work around it).

Thanks for the well-written report, @renaudjester.

(as an additional comment, the link you've been using is actually on s3, which means that by using that protocol you can access the data a bit more efficiently:

xr.open_dataset(
    "s3://mdl-arco-time-035/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-h_202012/timeChunked.zarr",
    engine="zarr",
    storage_options={"endpoint_url": "https://s3.waw3-1.cloudferro.com", "anon": True},
)

@keewis keewis closed this as completed Sep 25, 2024
@spencerkclark
Copy link
Member

Great, thanks for confirming @keewis.

@renaudjester
Copy link
Author

Thanks a lot :D

Super, I will wait for the next release then!

@keewis Thanks for the tips! Could you just point out to me why this is a bit more efficient?

@keewis
Copy link
Collaborator

keewis commented Sep 25, 2024

as far as I can tell (and I'm by no means an expert on this), the S3 protocol is a REST API. This means that while it is possible to talk to it using just HTTP vocabulary, it doesn't allow you to be as precise when requesting data, so you'll have some overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants