Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom seasons spanning calendar years #423

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Mar 3, 2023

Description

TODO:

  • Support custom seasons that span calendar years (_shift_spanning_months())
    • Requires detecting order of the months in a season. Currently, order does not matter.
    • For example, for custom_season = ["Nov", "Dec", "Jan", "Feb", "Mar"]:
      • ["Nov", "Dec"] are from the previous year since they are listed before "Jan"
      • ["Jan", "Feb", "Mar"] are from the current year
      • Therefore, ["Nov", "Dec"] need to be shifted a year forward for correct
        grouping.
  • Detect and drop incomplete seasons (_drop_incomplete_seasons())
    • Right now xCDAT only detects incomplete "DJF" seasons with _drop_incomplete_djf()
    • Replace boolean config drop_incomplete_djf with drop_incomplete_season
    • A possible solution for detecting incomplete seasons is to check if a season has all of the required months. If the count of months for that season does not match the expected count, then drop that season.
  • Remove requirement for all 12 months to be included in a custom season
  • Refactor logic for shifting months to use Xarray instead of Pandas
  • The current logic maps the custom season to its middle month, represented as cftime time coordinates. Does it make sense to also keep the custom seasons with the time coordinates, similar to what Xarray does?
      <xarray.DataArray 'season' (season: 2)> Size: 32B
      array(['DJFM', 'AMJ'], dtype='<U4')
      Coordinates:
        * season   (season) <U4 32B 'DJFM' 'AMJ'

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

Additional Context

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Mar 3, 2023
@tomvothecoder tomvothecoder self-assigned this Mar 3, 2023
@tomvothecoder tomvothecoder force-pushed the feature/416-custom-season-span branch from 96c5eca to fa087b7 Compare March 6, 2023 20:31
@tomvothecoder
Copy link
Collaborator Author

Example result of _drop_incomplete_seasons():

# Before dropping
# -----------------
# 2000-1, 2000-2, and 2001-12 months in incomplete "DJF" seasons" so they are dropped
ds.time
<xarray.DataArray 'time' (time: 15)>
array([cftime.DatetimeGregorian(2000, 1, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 2, 15, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 3, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 4, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 5, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 6, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 7, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 8, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 9, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 10, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 11, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 12, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 1, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 2, 15, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 12, 16, 12, 0, 0, 0, has_year_zero=False)],
      dtype=object)
Coordinates:
  * time     (time) object 2000-01-16 12:00:00 ... 2001-12-16 12:00:00
Attributes:
    axis:           T
    long_name:      time
    standard_name:  time
    bounds:         time_bnds

# After dropping
# -----------------
ds_new.time
<xarray.DataArray 'time' (time: 12)>
array([cftime.DatetimeGregorian(2000, 3, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 4, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 5, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 6, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 7, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 8, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 9, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 10, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 11, 16, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2000, 12, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 1, 16, 12, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2001, 2, 15, 0, 0, 0, 0, has_year_zero=False)],
      dtype=object)
Coordinates:
  * time     (time) object 2000-03-16 12:00:00 ... 2001-02-15 00:00:00
Attributes:
    axis:           T
    long_name:      time
    standard_name:  time
    bounds:         time_bnds

@tomvothecoder
Copy link
Collaborator Author

Hey @lee1043, this PR seemed to be mostly done when I stopped working on it last year. I just had to fix a few things and update the tests.

Would you like to check out this branch to test it out on real data? Also a code review would be appreciated.

@tomvothecoder tomvothecoder marked this pull request as ready for review April 4, 2024 21:50
Copy link

codecov bot commented Apr 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (1cfd369) to head (8d156c2).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #423   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines         1555      1614   +59     
=========================================
+ Hits          1555      1614   +59     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@tomvothecoder sure, I will test it out and review. Thank you for the update!

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@tomvothecoder Can this be considered for v0.7.0?

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR self-review

Comment on lines 1012 to 991
warnings.warn(
"The `season_config` argument 'drop_incomplete_djf' is being "
"deprecated. Please use 'drop_incomplete_seasons' instead.",
DeprecationWarning,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Need to a specify a specific version that we will deprecate drop_incomplete_djf. Probably v0.8.0 or v0.9.0.

xcdat/temporal.py Outdated Show resolved Hide resolved
Comment on lines -1025 to -1003
if len(input_months) != len(predefined_months):
raise ValueError(
"Exactly 12 months were not passed in the list of custom seasons."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed requirements for all 12 months to be included in a custom season

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Apr 4, 2024

@tomvothecoder Can this be considered for v0.7.0?

This PR still needs thorough review before I'm confident in merging it. I'll probably tag Steve at some point.

The release after v0.7.0 is more realistic and reasonable. We can always initiate a new release for this feature whenever it is merged.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@tomvothecoder Can this be considered for v0.7.0?

This PR still needs thorough review before I'm confident in merging it. I'll probably tag Steve at some point.

The release after v0.7.0 is more realistic and reasonable. We can always initiate a new release for this feature whenever it is merged.

@tomvothecoder no problem. Thank you for consideration.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@tomvothecoder it looks like when custom season go beyond calendar year (Nov, Dec, Jan) there is error as follows.

import os
import xcdat as xc

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/psl/gn/v20181218/",
    "psl_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

ds = xc.open_mfdataset(input_data)

# Example of custom seasons in a three month format:
custom_seasons = [
    ['Dec', 'Jan'],
]

season_config = {'custom_seasons': custom_seasons, 'dec_mode': 'DJF', 'drop_incomplete_djf': True}

ds.temporal.group_average("psl", "season", season_config=season_config)
CPU times: user 448 ms, sys: 32.8 ms, total: 481 ms
Wall time: 471 ms
xarray.Dataset
Dimensions:
lat: 192bnds: 2lon: 384time: 4
Coordinates:
lat
(lat)
float64
-89.28 -88.36 ... 88.36 89.28
lon
(lon)
float64
0.0 0.9375 1.875 ... 358.1 359.1
time
(time)
object
2013-02-01 00:00:00 ... 2013-11-...
Data variables:
lat_bnds
(lat, bnds)
float64
dask.array<chunksize=(192, 2), meta=np.ndarray>
lon_bnds
(lon, bnds)
float64
dask.array<chunksize=(384, 2), meta=np.ndarray>
psl
(time, lat, lon)
float64
dask.array<chunksize=(1, 192, 384), meta=np.ndarray>
Indexes: (3)
Attributes: (44)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xarray/core/dataarray.py:862, in DataArray._getitem_coord(self, key)
    861 try:
--> 862     var = self._coords[key]
    863 except KeyError:

KeyError: 'time.year'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<timed exec> in ?()

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xcdat/temporal.py in ?(self, data_var, freq, weighted, keep_weights, season_config)
    400         }
    401         """
    402         self._set_data_var_attrs(data_var)
    403 
--> 404         return self._averager(
    405             data_var,
    406             "group_average",
    407             freq,

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xcdat/temporal.py in ?(self, data_var, mode, freq, weighted, keep_weights, reference_period, season_config)
    870 
    871         if self._mode == "average":
    872             dv_avg = self._average(ds, data_var)
    873         elif self._mode in ["group_average", "climatology", "departures"]:
--> 874             dv_avg = self._group_average(ds, data_var)
    875 
    876         # The original time dimension is dropped from the dataset because
    877         # it becomes obsolete after the data variable is averaged. When the

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xcdat/temporal.py in ?(self, ds, data_var)
   1295         dv = _get_data_var(ds, data_var)
   1296 
   1297         # Label the time coordinates for grouping weights and the data variable
   1298         # values.
-> 1299         self._labeled_time = self._label_time_coords(dv[self.dim])
   1300 
   1301         if self._weighted:
   1302             time_bounds = ds.bounds.get_bounds("T", var_key=data_var)

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xcdat/temporal.py in ?(self, time_coords)
   1470         >>>       dtype='datetime64[ns]')
   1471         >>> Coordinates:
   1472         >>> * time     (time) datetime64[ns] 2000-01-01T00:00:00 ... 2000-04-01T00:00:00
   1473         """
-> 1474         df_dt_components: pd.DataFrame = self._get_df_dt_components(
   1475             time_coords, drop_obsolete_cols=True
   1476         )
   1477         dt_objects = self._convert_df_to_dt(df_dt_components)

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xcdat/temporal.py in ?(self, time_coords, drop_obsolete_cols)
   1534 
   1535         # Use the TIME_GROUPS dictionary to determine which components
   1536         # are needed to form the labeled time coordinates.
   1537         for component in TIME_GROUPS[self._mode][self._freq]:
-> 1538             df[component] = time_coords[f"{self.dim}.{component}"].values
   1539 
   1540         # The season frequency requires additional datetime components for
   1541         # processing, which are later removed before time coordinates are

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xarray/core/dataarray.py in ?(self, key)
    869     def __getitem__(self, key: Any) -> Self:
    870         if isinstance(key, str):
--> 871             return self._getitem_coord(key)
    872         else:
    873             # xarray-style array indexing
    874             return self.isel(indexers=self._item_key_to_dict(key))

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xarray/core/dataarray.py in ?(self, key)
    861         try:
    862             var = self._coords[key]
    863         except KeyError:
    864             dim_sizes = dict(zip(self.dims, self.shape))
--> 865             _, key, var = _get_virtual_variable(self._coords, key, dim_sizes)
    866 
    867         return self._replace_maybe_drop_dims(var, name=key)

~/.conda/envs/xcdat_dev_20240401/lib/python3.11/site-packages/xarray/core/dataset.py in ?(variables, key, dim_sizes)
    212     if _contains_datetime_like_objects(ref_var):
    213         ref_var = DataArray(ref_var)
    214         data = getattr(ref_var.dt, var_name).data
    215     else:
--> 216         data = getattr(ref_var, var_name).data
    217     virtual_var = Variable(ref_var.dims, data)
    218 
    219     return ref_name, var_name, virtual_var

AttributeError: 'IndexVariable' object has no attribute 'year'
CPU times: user 240 ms, sys: 13.2 ms, total: 253 ms
Wall time: 249 ms
xarray.Dataset
Dimensions:
lat: 192bnds: 2lon: 384time: 1
Coordinates:
lat
(lat)
float64
-89.28 -88.36 ... 88.36 89.28
lon
(lon)
float64
0.0 0.9375 1.875 ... 358.1 359.1
time
(time)
object
2013-02-01 00:00:00
Data variables:
lat_bnds
(lat, bnds)
float64
dask.array<chunksize=(192, 2), meta=np.ndarray>
lon_bnds
(lon, bnds)
float64
dask.array<chunksize=(384, 2), meta=np.ndarray>
psl
(time, lat, lon)
float64
dask.array<chunksize=(1, 192, 384), meta=np.ndarray>
Indexes: (3)
Attributes: (44)
CPU times: user 106 ms, sys: 6.96 ms, total: 113 ms
Wall time: 110 ms
xarray.Dataset
Dimensions:
lat: 192bnds: 2lon: 384time: 1
Coordinates:
lat
(lat)
float64
-89.28 -88.36 ... 88.36 89.28
lon
(lon)
float64
0.0 0.9375 1.875 ... 358.1 359.1
time
(time)
object
2013-02-01 00:00:00
Data variables:
lat_bnds
(lat, bnds)
float64
dask.array<chunksize=(192, 2), meta=np.ndarray>
lon_bnds
(lon, bnds)
float64
dask.array<chunksize=(384, 2), meta=np.ndarray>
psl
(time, lat, lon)
float64
dask.array<chunksize=(1, 192, 384), meta=np.ndarray>
Indexes: (3)
Attributes: (44)

@tomvothecoder
Copy link
Collaborator Author

@lee1043 Thanks for trying to this out and providing an example script! I'll debug the stack trace.

@tomvothecoder tomvothecoder added the priority: soon Should be addressed soon. label Oct 10, 2024
@lee1043
Copy link
Collaborator

lee1043 commented Oct 10, 2024

@tomvothecoder I will give another round of test for this PR in this week. Sorry, it slipped from my list and thank you for the reminder!

@lee1043
Copy link
Collaborator

lee1043 commented Oct 11, 2024

I used 2 years of monthly time series data to test this PR. From the test results below, I believe the code is working as expected.

1. First sanity check: compare xcdat default vs custom season

import os
import xarray as xr
import xcdat as xc
import matplotlib.pyplot as plt

# input data is from 
# "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/psl/gn/v20181218/"

input_data = [
    "psl_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201201-201212.nc",
    "psl_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc"
]

ds = xc.open_mfdataset(input_data)

# Sanity check

def compare_xcdat_default_vs_custom(custom_season, custom_drop_incomplete_seasons=True, ax=None, title=None):
    
    if custom_season == "DJF":
        custom_seasons = [['Dec', 'Jan', 'Feb']]
        index = 0
    elif custom_season == "MAM":
        custom_seasons = [['Mar', 'Apr', 'May']]
        index = 1
    elif custom_season == "JJA":
        custom_seasons = [['Jun', 'Jul', 'Aug']]
        index = 2
    elif custom_season == "SON":
        custom_seasons = [['Sep', 'Oct', 'Nov']]
        index = 3
        
    # season from default method
    ds_default_season = ds.temporal.climatology(
        "psl",
        freq="season",
        weighted=True,
        season_config={"dec_mode": "DJF", "drop_incomplete_djf": True},
    )

    da_default_season = ds_default_season['psl'][index]
    #print(custom_season, 'default', ds_default_season.sizes, da_default_season.shape)
    
    # season from custom season
    season_config = {'custom_seasons': custom_seasons, 'dec_mode': 'DJF', 'drop_incomplete_seasons': custom_drop_incomplete_seasons}
    ds_custom_season = ds.temporal.group_average("psl", "season", season_config=season_config)
    # get climatology
    if len(ds_custom_season.time) > 1:
        ds_custom_season = ds_custom_season.bounds.add_missing_bounds()
        ds_custom_season = ds_custom_season.temporal.average("psl")
    da_custom_season = ds_custom_season["psl"].squeeze()
    
    #print(custom_season, 'custom', ds_custom_season.sizes, da_custom_season.shape)
    
    # plot
    (da_default_season - da_custom_season).plot(ax=ax)
    
    if ax is not None and title is not None:
        ax.set_title(title)

fig, ax = plt.subplots(2, 4, figsize=(12, 5))

seasons = ["DJF", "MAM", "JJA", "SON"]

for i, season in enumerate(seasons):
    compare_xcdat_default_vs_custom(season, custom_drop_incomplete_seasons=True, ax=ax[0,i], title=season)
    compare_xcdat_default_vs_custom(season, custom_drop_incomplete_seasons=False, ax=ax[1,i], title=season)

fig.suptitle("xcdat default - custom \n upper: custom_drop_incomplete_seasons=True \n upper: custom_drop_incomplete_seasons=False")

fig.tight_layout()

compare_xcdat_default_vs_custom_season

The result is consistent to that is expected. DJF is one time step with drop_incomplete_seasons=True so no difference between default and custom approach (upper left). It became different when incomplete season is not dropped from custom, while it was dropped from default (lower left). Other seasons are not very much influenced by the parameter. so noise like differences are shown.

2. Compare to other tools

custom_season = "JJAS"
custom_seasons = [['Jun', 'Jul', 'Aug', 'Sep']]

#
# xcdat
#
ds = xc.open_mfdataset(input_data)

season_config = {'custom_seasons': custom_seasons, 'dec_mode': 'DJF', 'drop_incomplete_seasons': True}

ds_cs_xcdat = ds.temporal.group_average("psl", "season", season_config=season_config)
# get climatology
if len(ds_cs_xcdat.time) > 1:
    ds_cs_xcdat = ds_cs_xcdat.bounds.add_missing_bounds()
    ds_cs_xcdat = ds_cs_xcdat.temporal.average("psl")
da_cs_xcdat = ds_cs_xcdat["psl"].squeeze()

#
# CDAT
# 

# "cdscan -x test.xml *.nc" conducted as pre-process

import cdms2
import cdutil

f = cdms2.open('test.xml')
d = f('psl')
custom_season_class = cdutil.times.Seasons(custom_season)
d_cs = custom_season_class.climatology(d)

# 
# PMP (xcdat and xarray method)
# 
from pcmdi_metrics.utils import custom_season_average

# xcdat method
ds_cs_pmp_xcdat_yearly = custom_season_average(ds, "psl", season=custom_season, method="xcdat")
ds_cs_pmp_xcdat_yearly = ds_cs_pmp_xcdat_yearly.bounds.add_missing_bounds()
ds_cs_pmp_xcdat = ds_cs_pmp_xcdat_yearly.temporal.average("psl")
da_cs_pmp_xcdat = ds_cs_pmp_xcdat['psl']

# xarray method
ds_cs_pmp_xarray_yearly = custom_season_average(ds, "psl", season=custom_season, method="xarray")
ds_cs_pmp_xarray = ds_cs_pmp_xarray_yearly.mean(dim=["year"])
da_cs_pmp_xarray = ds_cs_pmp_xarray['psl']

# 
# Compare results
# 
fig, ax = plt.subplots(2, 4, figsize=(12, 5))

da_cs_xcdat.plot(ax=ax[0,0])
da_cs_cdat.plot(ax=ax[0,1])
da_cs_pmp_xcdat.plot(ax=ax[0,2])
da_cs_pmp_xarray.plot(ax=ax[0,3])


(da_cs_xcdat.to_numpy() - da_cs_cdat).plot(ax=ax[1,0])
(da_cs_cdat - da_cs_pmp_xcdat.to_numpy()).plot(ax=ax[1,1])
(da_cs_xcdat - da_cs_pmp_xcdat).plot(ax=ax[1,2])
(da_cs_pmp_xcdat - da_cs_pmp_xarray).plot(ax=ax[1,3])

ax[0,0].set_title("xcdat")
ax[0,1].set_title("cdat")
ax[0,2].set_title("pmp (xc)")
ax[0,3].set_title("pmp (xr)")

ax[1,0].set_title("xcdat - cdat")
ax[1,1].set_title("cdat - pmp (xc)")
ax[1,2].set_title("xcdat - pmp (xc)")
ax[1,3].set_title("pmp (xc) - pmp (xr)")

fig.suptitle(custom_season)

fig.tight_layout()

compare_xcdat_cdat_pmp_JJAS

No difference between xcdat, cdat, and pmp (xcdat) results, while pmp (xarray) is inconsistent maybe because it does not consider propor temporal weighting.

@lee1043 lee1043 self-requested a review October 11, 2024 01:18
@lee1043
Copy link
Collaborator

lee1043 commented Oct 11, 2024

@tomvothecoder if you plan to switch drop_incomplete_djf to drop_incomplete_seasons in this PR, do you plan to apply that to other functions as well? That would makes sense if there is a time series starting from July ending in June after years, and want to get JJA temporal average. But this can be a breaking change if drop_incomplete_djf is being used in one's code and they update xcdat version. I wonder if the following could be a reasonable transition plan: mapping drop_incomplete_djf to drop_incomplete_seasons during the transition, and print a deprecation warning sign if drop_incomplete_djf is being used.

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 11, 2024

@tomvothecoder if you plan to switch drop_incomplete_djf to drop_incomplete_seasons in this PR, do you plan to apply that to other functions as well? That would makes sense if there is a time series starting from July ending in June after years, and want to get JJA temporal average. But this can be a breaking change if drop_incomplete_djf is being used in one's code and they update xcdat version.

The drop_incomplete_seasons functionality applies to all of the group averaging APIs (group_average, climatology, and departures).

I wonder if the following could be a reasonable transition plan: mapping drop_incomplete_djf to drop_incomplete_seasons during the transition, and print a deprecation warning sign if drop_incomplete_djf is being used.

I commented in this thread above that users will who use drop_incomplete_djf will receive a DeprecationWarning to use drop_incomplete_seasons instead.

In this PR right now, drop_incomplete_djf is mapped to drop_incomplete_seasons. However, I don't think we should map these configs because they behave differently. drop_incomplete_djf removes the leading and trailing incomplete DJF seasons, while drop_incomplete_seasons will drop all incomplete seasons. It is better to keep the behaviors of these configs separate to not break any existing code that depends on the current drop_incomplete_djf behavior. Users will have time to transition over to drop_incomplete_seasons before drop_incomplete_djf is deprecated.

I will update this PR to restore the original drop_incomplete_djf functionality to have it alongside drop_incomplete_seasons.

@tomvothecoder
Copy link
Collaborator Author

From our meeting on 10/23:

  • Near-term solution: move forward with this PR and merge it when ready
  • Long-term solution: consider refactoring the underlying logic with Xarray #9524 in the future
    • Scope out effort in refactoring xCDAT temporal backend to move away from Pandas to Xarray (need multi variable grouping)
    • Experiment with this Xarray PR now, see if it fits our requirements for custom seasons

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 24, 2024

@xCDAT/core-developers For the final time coordinate output, the current logic maps the custom season to its middle month. For example, year 2000 and "NDJFM" will be represented by cftime(2000, 1, 1, 0...) after seasonal group averaging.

How useful would it be to keep the custom seasons as auxiliary coords? For example, (2000, NDJFM). I think it would be helpful to have since it explicitly shows what the custom seasons are.

In Xarray, grouping by season will result in "season" coords. I'm not sure if these auxiliary coords are CF-compliant, although probably not a big deal. For example,

  <xarray.DataArray 'season' (season: 2)> Size: 32B
  array(['DJFM', 'AMJ'], dtype='<U4')
  Coordinates:
    * season   (season) <U4 32B 'DJFM' 'AMJ'

Alternatively, the user can just reference the attributes of the data variable to see the custom seasons.

{'operation': 'temporal_avg',
 'mode': 'group_average',
 'freq': 'season',
 'weighted': 'True',
 'drop_incomplete_seasons': 'True',
 'custom_seasons': ['NovDecJanFebMar']}

Let me know your thoughts. This can also be a future enhancement or something to include if we decide to refactor later on.

@tomvothecoder
Copy link
Collaborator Author

In commit 0b6852f (#423), I refactored the logic for shifting months to use Xarray/NumPy instead of Pandas. We'll gradually shift away from the Pandas back-end for storing and manipulating Datetime components (used to group time coordinates) towards Xarray/Numpy in #217 since xarray >=2024.09.0 now supports grouping by multiple variables.

I will do a final walk through at the next xCDAT meeting (11/20) before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: soon Should be addressed soon. project: seats-fy24 type: enhancement New enhancement request
Projects
Status: In Review
5 participants