Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datetime and timedelta encoding for increased robustness #9498

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

spencerkclark
Copy link
Member

This PR makes the proposed updates in #9488 (comment), removing the guard against overflow in dtype casting. In so doing I have added units checking / potential modification to the cftime code path to bring things up to speed with the NumPy code path, and also added a regression test for #9134.

Now we more robustly raise an error in the case of chunked times where an integer dtype encoding is specified, but the units do not allow for an accurate roundtrip. This was the original intent of the code, though we can also discuss whether it is better to warn instead of raise in this instance—one way or another this would also make it safer to switch the default units with which we encode chunked times (#9154).

As @kmuehlbauer notes, this would be another way to fix #9488; in some ways it is complementary to #9497.

* Implement robust units checking for lazy encoding
* Remove dtype casting, which interferes with missing value encoding
Copy link

@langmore langmore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After line 332, we could have an int32 value for num_dates (e.g. if the user provided an int32 value to begin with). This will be passed down to _decode_datetime_with_pandas, which, upon calling pd.to_timedelta(flat_num_dates.min(), time_units) will raise OutOfBoundsTimedelta: Cannot cast 24 from h to 'ns' without overflow due to pandas not handling the int32 first argument. This is preventable by casting num_dates to int64.

I believe this is related to pandas-dev/pandas#56996

@spencerkclark
Copy link
Member Author

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

@langmore
Copy link

Thanks for pointing this out @langmore. Indeed I think we actually have tests that would fail because of this in our suite, but we don't (yet) have a CI environment that has NumPy >= 2, but no cftime. In other words the fact that we can fall back to decoding times with cftime masks this.

Your suggested workaround makes sense to me; we may just want to make sure that we cast signed integers to int64 and unsigned integers to uint64 to avoid overflow (and avoid doing any casting for floats). If you'd like, I think we'd be happy to take a PR; I can also take care of it if you prefer. Unfortunately my attempt to fix this upstream stalled.

@spencerkclark I won't have time to fix this, since we're also falling back on cftime so it works for us, for now.

@spencerkclark
Copy link
Member Author

No worries @langmore—I went ahead with something along the lines of your suggested fix in #9518.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants