Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Add weight threshold option for averaging operations #531

Open
pochedls opened this issue Aug 15, 2023 · 11 comments · May be fixed by #672
Open

[Enhancement]: Add weight threshold option for averaging operations #531

pochedls opened this issue Aug 15, 2023 · 11 comments · May be fixed by #672
Assignees
Labels
type: enhancement New enhancement request work: complex

Comments

@pochedls
Copy link
Collaborator

Is your feature request related to a problem?

When xCDAT performs an averaging operation (spatial or temporal) with missing data, it assigns missing values a weight of zero (which is correct). But this can be misleading if some or the majority of data is missing. Imagine if data was missing in spring, fall, and summer (leaving only winter data). xCDAT would take the annual average and report a very cold annual average temperature. A user is usually aware of this kind of thing, but may miss it if a small number of grid cells are missing part of the dataset (or are missing anomaly values, which would be harder to recognize as "weird").

See the example below:

import xcdat as xc
import numpy as np
# load example dataset
fn = '/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2/amip/r1i1p1f1/Amon/tas/gn/v20190218/tas_Amon_CESM2_amip_r1i1p1f1_gn_195001-201412.nc'
ds = xc.open_dataset(fn)
# select 12 months of data (for some midlatitude grid cell over Asia)
ds = ds.isel(time=np.arange(0, 12), lat=[150], lon=[50]).squeeze()
print('range of seasonal cycle: ' + str(np.min(ds.tas.values)) + ' - ' + str(np.max(ds.tas.values)))
print('annual average: ' + str(ds.temporal.group_average('tas', freq='year').tas.values[0]))
# NaN out data over Jan - Nov
ds['tas'][0:11] = np.nan
print('annual average (with NaNs): ' + str(ds.temporal.group_average('tas', freq='year').tas.values[0]))

range of seasonal cycle: 260.47635 - 297.10477
annual average: 278.48889160156256
annual average (with NaNs Jan - Nov): 265.1025406006844

A similar situation can arise if a timestep of observations is missing part of the field during spatial averaging operations (e.g., missing the tropics, leading to too cold global temperatures).

Describe the solution you'd like

This would need to be mapped out more, but it might be useful to have a weight_threshold optional argument that allows the user to specify the minimum temporal/spatial weight needed to produce a value. For example weight_threshold=0.9 would require 90% of the spatial / temporal data within the spatial or temporal averaging window be present.

CDAT had similar functionality for temporal averaging calculations (see Specifying Data Coverage Criteria. I'm not sure if there was any similar functionality for spatial averaging.

Describe alternatives you've considered

No response

Additional context

No response

@pochedls pochedls added the type: enhancement New enhancement request label Aug 15, 2023
@lee1043
Copy link
Collaborator

lee1043 commented Aug 15, 2023

@pochedls thank you for posting this. I think this would be very important, especially when there is time-varying missing data in different locations, which is not uncommon in observation dataset. @gleckler1 handles observation datasets using xcdat for obs4mips, and this subject is going to be very relevant to his work. Datasets from obs4mips are used as reference datasets in PMP, thus this issue is also related to PMP.

@gleckler1
Copy link

gleckler1 commented Aug 16, 2023

@pochedls @lee1043 Thanks for bringing this up! It would be great to have some sort of weight_threshold ... is the thinking that if the threshold was not met an NaN would be given or an error would be raised, ? It would be great to have this for both time and space but my first choice would be time. Thanks again for thinking of this.

@taylor13
Copy link

Here's an exchange I had with Chris Golaz related to this in 2018 and addresses the question of an annual mean. It requires one to form monthly means, then if at least one month of data was available during each season, then an annual mean would be calculated. If all 3 months were missing in one or more seasons, the annual mean would be considered missing.

I'll also try to find my notes on using a centroid criteria for means of cyclical data (like the annual cycle). It gave the user more control over how much data could be missing in a situation like that.


Chris,

I think the seasonal climatology assumes the first month in the time-series is January, so that part needs to be generalized to handle different starting months.

happy coding,
Karl

On 5/23/18 3:05 PM, Chris Golaz wrote:
Karl,

Thanks for that. This algorithm makes perfect sense to me with reasonable choices for propagating missing values from monthly to seasonal and annual. Maybe I should code it up in Fortran to see how much faster than Python it can be...

As far as I can tell, the extra complication in cdutil comes from the added flexibility on how missing values propagate up in the average (with the option of specifying a threshold and centroid).

-Chris

On 05/23/2018 02:43 PM, Karl Taylor wrote:
what, you aren't at NOAA?
Here's what I sent:

-------- Forwarded Message --------
Subject: Fwd: climatologies
Date: Wed, 23 May 2018 11:27:25 -0700
From: Karl Taylor [email protected]
To: Chris Golaz [email protected]

Hi Chris,

I can't find notes from 2001, but below I've copied a suggestion I made for computing climatologies for use with the metric package. The pseudo-code can handle missing months. It computes climatological monthly means first, then from those it computes the seasonal means and the annual means.

If this has been implemented, it was probably implemented on top of for fundamental CDMS functions.

Hope this helps a little,
Karl

-------- Forwarded Message --------
Subject: climatologies
Date: Thu, 9 Jul 2015 13:51:31 -0700
From: Karl Taylor [email protected]
To: Doutriaux, Charles [email protected], Gleckler, Peter [email protected], Durack, Paul J. [email protected]

Hi Charles and all,

Here's a proposal for how to compute climatological means, starting with
monthly data. Does anyone want to suggest a different approach?

Let's consider a seasonal mean missing if the climatological monthly
mean for one or more of the 3 months in that season is/are missing.
Let's consider an annual mean missing if the climatological monthly mean
for one or more of the 12 months of the year is/are missing.

Then I suggest the following algorithm:

Let f(x, m, y) be the value at grid-cell x, year y, and month m.
Let w(m, y) be the length of month m of year y.

*** First compute monthly climatologies for each grid cell [C(x, m)]

loop over grid cells (x)

    loop over months of year (m)

        C(x, m) = 0.
        A(x, m) = 0.
        n = 0

          loop over years (y)

                If (f(x, m, y) .not. missing)
                    n = n + 1
                    C(x, m) = C(x, m) + f(x, m, y) * w(m, y)
                    A(x, m) = A(x, m) + w(m, y)

            end y loop

            if (n = 0)
                C(x, m) = missing
            else
                C(x, m) = C(x, m) / A(x,m)
                A(x, m) = A(x, m) / n
            endif

    end m loop

end x loop

*** Now compute seasonal mean climatologies [Cs(x, s)]:

loop over grid cells (x)

    loop over 4 seasons (s)

        Cs(x, s) = 0.

        m1 = 3*s
        m2 = m1 + 1
        m3 = m1 + 2
        if m1 = 0 then m1 = 12

                If   A(x, m1) * A(x, m2) * A(x, m3)  = 0.  then

                    Cs(x, s) = missing

                else

                    Cs(x, s) = (C(x, m1)*A(x, m1) + C(x, m2)*A(x, 

m2) + C(x, m3)*A(x, m3) ) / (A(x, m1) + A(x, m2) + A(x, m3))

        end s loop

end x loop

*** Now compute annual mean climatology [Ca(x)]:

loop over grid cells (x)

    Ca(x) = 0.
    Aa(x) = 0.

    loop over months (m)

            If A(x,m) = 0. then

                Ca(x) = missing
                exit loop over months

            else

                Ca(x) = Ca(x) + C(x, m)*A(x, m)
                Aa(x) = Aa(x) + A(x, m)

            endif

    end m loop

    If Aa(x) > 0. then

        Ca(x) = Ca(x) / Aa(x)

    endif

end x loop

Notes:

  1. if there are no missing data and an integral number of years is
    considered, then Ca(x) will be equivalent to a straight-forward average
    of all the monthly values (each weighted by month length); otherwise the
    two means can differ.

  2. If we want to invariably compute all three climatologies (monthly,
    seasonal and annual), then the x loop can be placed around all three
    (instead of having 3 separate x loops)

that's all,
Karl

@taylor13
Copy link

I can't find my notes on this, but there is some description of how the centroid method I came up with is applied in https://cdat.llnl.gov/documentation/utilities/utilities-1.html under "temporal averaging". In general two criteria are set: a minimum coverage of the time period (threshold), and a constraint requiring data to be near the centroid of times sampled.

For a simple time-series (assumed not to be quasi-cyclic), the centroid is calculated as a simple mean of all times with data. This may be useful in deciding whether you can calculate a meaningful mean over a interval that includes a trend. If the mean of the sampling times is too close to one end the interval, then you'll get a non-representative time-mean.

For quasi-cyclic data like the diurnal cycle or the annual cycle, the centroid is calculated as for a two-dimensional field. For an annual mean to be calculated from monthly values, for example, you can specify that the centroid lie near the point calculated when all months are available. You basically treat each month as a point on an analog clock face, and leaving out the missing months calculate the centroid of the remaining months. Assume you've centered the clock on a polar coordinate system. If the radial distance to the centroid is less than some threshold, then the mean of the monthly values will give a reasonable annual mean. You might also, of course, set a minimum number of months as a second criteria.

@tomvothecoder
Copy link
Collaborator

Hi @taylor13, thank you for your input! We are planning to carefully review your notes about weight thresholds and the "centroid" function.

Here are more related comments:

You also sent us a helpful email about this on 7/14/21:

I designed the “missing data” treatment of that function, which I thought was useful and not trivial to include.

It’s not that the missing data can’t be easily handled, but that a user may want to set criteria for excluding locations entirely from a climatology if the amount of missing data is excessive or unrepresentative. For example suppose in some location, all of the winter data are missing; you might want to exclude that location if you’re computing an annual mean climatology because the annual mean would be skewed warm. Even in transition seasons (e.g., spring) if data are missing for the first month of the season, but are available for the last month, the seasonal mean will be skewed and not accurately represent the seasonal mean.

As I recall, CDAT allowed the user to set two criteria for excluding a location (i.e., setting it to “missing”) to mask out locations with insufficient data. The user could require a certain percentage of the data be present and the user could require the “centroid” of the contributing data to be within a certain distance of the location of the centroid of a dataset without missing data. For data that is cyclical (like the annual cycle) the centroid for climatologies is computed by considering each month to be located at the 12 divisions of a clock face. Without missing data the centroid would lie at the center of the circle. If data for the winter months were missing (months 12, 1, and 2), then the centroid would be offset toward the 7. However if months 1, 3, 5, 7, 9, and 11 were missing, the centroid would lie again at the center of the circle (and the annual mean computed from these values would more likely be accurate than one in which the winter months were missing. By specifying how close to the center of the circle the centroid must be, a user can control when to include a given location in the annual mean climatology. Centroids for individual seasons are similarly computed.

I’m not sure where in the code this all is done, but there is some discussion under https://cdat.llnl.gov/documentation/utilities/utilities-1.html (I meant to send that link earlier but must have not.) Look at the paragraphs under the heading “Specifying Data Coverage Criteria”.

@taylor13
Copy link

thanks for linking in the earlier input. I think the xcdat strategy should probably be to implement something rather simple like the "seasons" approach suggested in #531 (comment) . Two cases might be commonly encountered. Monthly mean data covering multiple years. Here, if there are no big trends, the multi-year calendar months could be averaged to form a single annual cycle. The other is to compute a time-series of annual means. For this case an annual mean could be calculated when at least one month of data was available for each of the 4 seasons. Seasonal means would be calculated from the months available within each season and then the annual mean would be calculated from all 4 seasons. (for seasonal means, the months should be weighted by the length of each month, and for annual means the seasons should be weighted by the length of each season.)

there are another of other simple options, one might adopt, so perhaps others can weigh in.

@pochedls
Copy link
Collaborator Author

pochedls commented Jun 7, 2024

Our last xcdat made me think about this issue a little bit. I think something along the lines of the following could work for the xcdat internal spatial _averager engine (required_weight would need to be exposed to .average(), too), though it will require a lot of though and testing to ensure robustness:

def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis], required_weight: Optional[float] = 0.):
    """Perform a weighted average of a data variable.

    This method assumes all specified keys in ``axis`` exists in the data
    variable. Validation for this criteria is performed in
    ``_validate_weights()``.

    Operations include:

    - Masked (missing) data receives zero weight.
    - Perform weighted average over user-specified axes/axis.

    Parameters
    ----------
    data_var : xr.DataArray
        Data variable inside a Dataset.
    axis : List[SpatialAxis]
        List of axis dimensions to average over.
    required_weight : optional, float
        Fraction of data coverage (i..e, weight) needed to return a
        spatial average value.

    Returns
    -------
    xr.DataArray
        Variable that has been reduced via a weighted average.

    Notes
    -----
    ``weights`` must be a DataArray and cannot contain missing values.
    Missing values are replaced with 0 using ``weights.fillna(0)``.
    """

    # note - I don't know if .fillna() does anything here. I think it might have
    #        been something missed in translation from my original averager
    #        function which was much slower
    weights = self._weights.fillna(0)

    # need weights to match data_var dimensionality
    if required_weight > 0.:
        weights, data_var = xr.broadcast(weights, data_var)

    # original code to get averaging dimensions
    dim = []
    for key in axis:
        dim.append(get_dim_keys(data_var, key))

    if required_weight > 0.:
        # sum all weights (assuming no missing values exist)
        weight_sum_all = weights.sum(dim=dim)
        # zero out cells with missing values in data_var
        weights = xr.where(~np.isnan(data_var), weights, 0)
        # sum all weights (including zero for missing values)
        weight_sum_masked = weights.sum(dim=dim)
        # get fraction of weight available
        frac = weight_sum_masked / weight_sum_all

    # original code to get weighted mean
    with xr.set_options(keep_attrs=True):
        weighted_mean = data_var.cf.weighted(weights).mean(dim=dim)

    if required_weight > 0.:
        # nan out values that don't meet specified weight threshold
        weighted_mean = xr.where(frac > required_weight, weighted_mean, np.nan)

    return weighted_mean

Other things to think about:

  • How does this affect performance (this is more similar to our initial averager, which was slower)?
  • How does this affect regional averaging or non-weighted averaging
  • Another performance related enhancement would be to sub-select data before performing regional averages

@taylor13
Copy link

taylor13 commented Jun 7, 2024

You say the weights must be a data array. Are there other requirements? For time, is it assumed they have been set proportional to the length of each month? If you compute a monthly mean from daily means, with some daily means missing, do the weights for the monthly means get set proportional to the number of days in the month with valid data?

Conceptually, I find it easiest to keep track of masking by enabling the analysis package to compute the full weights (assuming no missing values), based on the coordinate bounds (but with an option for the user to provide these in difficult cases like weird grids). Then, in addition, I would like to make it easy for the user to associate a data array containing "fractional weights", which would indicate when a data value should be down-weighted from its full value. The actual weights are then the product of the full weights array and the fractional weights array. For simple binary masks, the fractions would either be 0 or 1; for unmasked data the fractions would all be 1.

As an example, consider a latxlon array with data only available over land. Suppose we want to regrid it. On the original grid the analysis package would compute areas for each grid cell based on the lat and lon coordinate bounds. The user would indicate which grid cells had valid data by, for example, setting the fractional weights to 0 for ocean cells and 1 for the land cells. The regridder could use this information and conservatively map the data to some target grid. The user would expect to have returned (along with the regridded field) the fraction of each target cell with contributions from the source cells (i.e., the fractional weights), and as for the original grid, the full weights could be inferred from the lat and lon bounds of the target grid (and the regridder would presumably have calculated these).

The above "averager" function would for calculate weight_sum_all directly from the array containing full weights and would calculate weight_sum_masked from the product of the full weights array and the fractional weights array.

Maybe in effect that is what is already being done (or something essentially equivalent), in which case this comment can be ignored.

@pochedls
Copy link
Collaborator Author

pochedls commented Jun 8, 2024

@taylor13 - thanks for looking this over and the comments. This might be more efficient as an in-person discussion (please send me an email if you want to chat more). You have a lot of nice suggestions in this thread. I think some comments might be better placed in separate discussions and, given limited resources, it would be helpful to differentiate ideas across a spectrum of a) nice to have, but not frequently used to b) very important, and commonly used.

You say the weights must be a data array. Are there other requirements? For time, is it assumed they have been set proportional to the length of each month? If you compute a monthly mean from daily means, with some daily means missing, do the weights for the monthly means get set proportional to the number of days in the month with valid data?

My comment was regarding spatial averaging operations (I updated the comment to clarify this) - not temporal averaging. The spatial weight matrix is calculated internally (using the coordinate bounds) or it can be supplied via the public averaging interface [note that the function above is an internal helper function]. The comment was meant to serve be a technical pointer of where we might add code to address this issue (adding a weight threshold for averaging operations). I think we should focus this thread on the issue of applying weight thresholds. Discussion / questions / suggestions on how to generate weights might be better placed in a new discussion or a separate issue. The readthedocs page provides information on how these features work in the existing xcdat package.

Conceptually, I find it easiest to keep track of masking by enabling the analysis package to compute the full weights (assuming no missing values), based on the coordinate bounds (but with an option for the user to provide these in difficult cases like weird grids). Then, in addition, I would like to make it easy for the user to associate a data array containing "fractional weights", which would indicate when a data value should be down-weighted from its full value. The actual weights are then the product of the full weights array and the fractional weights array. For simple binary masks, the fractions would either be 0 or 1; for unmasked data the fractions would all be 1.

The user can supply their own weights via the public .average() function or have xcdat generate spatial weights for them. In both cases – there is only one weight array. The obvious case for providing fractional weights is for regional averages (where a grid cell might get partial weight if it is partly in the averaging domain). We don't currently provide fractional weights. I think there are ways to accomplish this in the existing xcdat package or as a new feature – this seems like a separate discussion or feature request.

As an example, consider a latxlon array with data only available over land. Suppose we want to regrid it. On the original grid the analysis package would compute areas for each grid cell based on the lat and lon coordinate bounds. The user would indicate which grid cells had valid data by, for example, setting the fractional weights to 0 for ocean cells and 1 for the land cells. The regridder could use this information and conservatively map the data to some target grid. The user would expect to have returned (along with the regridded field) the fraction of each target cell with contributions from the source cells (i.e., the fractional weights), and as for the original grid, the full weights could be inferred from the lat and lon bounds of the target grid (and the regridder would presumably have calculated these).

The above "averager" function would for calculate weight_sum_all directly from the array containing full weights and would calculate weight_sum_masked from the product of the full weights array and the fractional weights array.

Maybe in effect that is what is already being done (or something essentially equivalent), in which case this comment can be ignored.

This issue was motivated by dealing with missing values and a particular problem: you have valid data in a small part of your temporal or spatial domain and missing values elsewhere. This can lead to an issue where a very small fraction of data (e.g., one day in a yearly average or one grid cell in a spatial domain) can give you an unrepresentative average. This is currently how the average works (not ideal), though there is a separate issue that has the effect of masking data that has any missing values (also not ideal, but easy to implement). This issue/thread aims to give a little more control by allowing you to specify how much of the data (i.e., the fractional weight) must exist to return a value (otherwise NaN is returned).

I think this scenario is a little different, because the user wants to bring in their own constraints (in this case a separate matrix about land/ocean fraction). This would probably be best handled by the user using xcdat tools (e.g., .get_weights to get the full spatial weight from the coordinate bounds) and deciding how to implement this analysis. If this is a real-world use-case that you'd like help with, I think an xcdat discussion would be a good venue, because it seems like this would require feature requests for the regridder functionality, too (and this thread is mostly about averaging using the existing weight schemes).

@lee1043
Copy link
Collaborator

lee1043 commented Jun 11, 2024

@pochedls @taylor13 thanks for the discussion and sharing the insights and knowledge. I believe this is important for more precise weighting when NaN values are included.

I had a chance to chat about this in brief with @taylor13 today and leaving this note to share and remind myself. Regarding the temporal average bounds, I learned that the below code handles it in CDAT:
https://github.com/CDAT/cdutil/blob/b823b69db46bb76536db7d435e72075fc3975c65/cdutil/times.py#L14

I also acknowledge that to calculate spatial average more precisely, NaN value threshold will be better defined with the area size represented by NaN-having grids than number of NaN grid points. (Extreme case: total 10 data points, which includes 9 near the pole and 1 over tropics)

Just for your interest, below is a prototype testing code that I wrote for obs4MIPs, which I think is the similar approach to one in @pochedls's above comment.

count_NaN_values.pdf (Note: First half for toy dummy data, second half for real-world data.)

My prototype code is yet to have the sophisticated weighting capability that @taylor13 explained for now. I think I will need to give more thoughts on that.

@taylor13
Copy link

Example 6 in https://xcdat.readthedocs.io/en/latest/examples/spatial-average.html indicates that xcdat already gives the user control over providing a time-varying "weights" array along with the data to do spatial averaging. So, most of my input above is already provided for.
Regarding thresholds for including/excluding a spatial average values based on weights, you could leave it to the user to do the coding. It's pretty easy, I think, and would ensure that they knew exactly what they were doing. Of course giving them the option to use whatever algorithm you come up with could be useful to some, as long as it doesn't preclude someone using the averager's output of spatial means and the associated weights to do the filtering after the fact however they wished.
Sorry to have side-tracked the discussion above.

@tomvothecoder tomvothecoder changed the title [Feature]: Set (optional) weight threshold for averaging operations [Enhancement]: Add weight threshold option for averaging operations Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request work: complex
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

5 participants