Refactor datetime and timedelta encoding for increased robustness #9498

spencerkclark · 2024-09-15T14:47:50Z

This PR makes the proposed updates in #9488 (comment), removing the guard against overflow in dtype casting. In so doing I have added units checking / potential modification to the cftime code path to bring things up to speed with the NumPy code path, and also added a regression test for #9134.

Now we more robustly raise an error in the case of chunked times where an integer dtype encoding is specified, but the units do not allow for an accurate roundtrip. This was the original intent of the code, though we can also discuss whether it is better to warn instead of raise in this instance—one way or another this would also make it safer to switch the default units with which we encode chunked times (#9154).

As @kmuehlbauer notes, this would be another way to fix #9488; in some ways it is complementary to #9497.

Closes Can not save timedelta data arrays with small integer dtypes and _FillValue #9134
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

* Implement robust units checking for lazy encoding * Remove dtype casting, which interferes with missing value encoding

langmore

After line 332, we could have an int32 value for num_dates (e.g. if the user provided an int32 value to begin with). This will be passed down to _decode_datetime_with_pandas, which, upon calling pd.to_timedelta(flat_num_dates.min(), time_units) will raise OutOfBoundsTimedelta: Cannot cast 24 from h to 'ns' without overflow due to pandas not handling the int32 first argument. This is preventable by casting num_dates to int64.

I believe this is related to pandas-dev/pandas#56996

spencerkclark · 2024-09-18T02:06:27Z

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

langmore · 2024-09-18T16:49:49Z

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

@spencerkclark I won't have time to fix this, since we're also falling back on cftime so it works for us, for now.

spencerkclark · 2024-09-19T00:04:45Z

No worries @langmore—I went ahead with something along the lines of your suggested fix in #9518.

spencerkclark added 3 commits September 15, 2024 09:32

Increase robustness of encode_cf_datetime

bab549b

* Implement robust units checking for lazy encoding * Remove dtype casting, which interferes with missing value encoding

Take the same approach for timedelta encoding

b31f75e

Fix typing

6c83b68

spencerkclark mentioned this pull request Sep 15, 2024

writing datetime64 in netCDF may produce badly formatted or unreadable files #9488

Closed

5 tasks

langmore reviewed Sep 16, 2024

View reviewed changes

dcherian mentioned this pull request Sep 17, 2024

Add datetime encode/decode property test #9507

Draft

spencerkclark mentioned this pull request Sep 19, 2024

Fix pandas datetime decoding with NumPy >= 2.0 for small integer dtypes #9518

Merged

2 tasks

TomNicholas added the topic-cftime label Sep 20, 2024

spencerkclark mentioned this pull request Sep 25, 2024

Slow performance when open zarr file with numpy>2.0.0 #9545

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor datetime and timedelta encoding for increased robustness #9498

Refactor datetime and timedelta encoding for increased robustness #9498

spencerkclark commented Sep 15, 2024

langmore left a comment

spencerkclark commented Sep 18, 2024

langmore commented Sep 18, 2024

spencerkclark commented Sep 19, 2024

Refactor datetime and timedelta encoding for increased robustness #9498

Are you sure you want to change the base?

Refactor datetime and timedelta encoding for increased robustness #9498

Conversation

spencerkclark commented Sep 15, 2024

langmore left a comment

Choose a reason for hiding this comment

spencerkclark commented Sep 18, 2024

langmore commented Sep 18, 2024

spencerkclark commented Sep 19, 2024