Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to subset grouped dataset with coordinate variables in MMT #127

Open
frankinspace opened this issue Nov 15, 2022 · 11 comments
Open
Labels
bug Something isn't working

Comments

@frankinspace
Copy link
Member

Trying to invoke l2ss-py in UAT with TEMPO_NO2-PROXY_L2_V01 collection results in an error from Harmony. Looking at the logs reveals:

message: "Calling l2ss-py subset with params {'bbox': array([[-180,  180],
       [ -90,   90]]), 'variables': ['/product/vertical_column_total'], 'lat_var_names': ['/geolocation/latitude'], 'lon_var_names': ['/geolocation/longitude'], 'time_var_names': [], 'output_file': '/home/dockeruser/data/2db0715037f6004561e2a0b608d3cd053f610471abfc165189ac3aea064517a8.nc', 'file_to_subset': '/tmp/tmpskdk42p8/2db0715037f6004561e2a0b608d3cd053f610471abfc165189ac3aea064517a8.nc', 'origin_source': 'https://dwq9my7b5nvyc.cloudfront.net/asdc2-uat-protected/TEMPO/TEMPO_NO2-PROXY_L2_V01/2013.07.31/TEMPO_NO2-PROXY_L2_V01_20130731T232959Z_S015G06.nc'}"

Then a stacktrace follows:

message: "'/geolocation/latitude'",
exc_info: "Traceback (most recent call last):
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1317, in _construct_dataarray
    variable = self._variables[name]
KeyError: '/geolocation/latitude'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/cli.py", line 215, in _invoke
    (out_message, out_catalog) = adapter.invoke()
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/adapter.py", line 116, in invoke
    return (self.message, self._process_catalog_recursive(self.catalog))
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/adapter.py", line 181, in _process_catalog_recursive
    output_item = self.process_item(item.clone(), source)
  File "/worker/podaac/subsetter/subset_harmony.py", line 170, in process_item
    result_bbox = subset.subset(**subset_params)
  File "/worker/podaac/subsetter/subset.py", line 1230, in subset
    lat_var_names, lon_var_names, time_var_names = get_coordinate_variable_names(
  File "/worker/podaac/subsetter/subset.py", line 1114, in get_coordinate_variable_names
    time_var_names = [
  File "/worker/podaac/subsetter/subset.py", line 1116, in <listcomp>
    dataset, dataset[lat_var_name]
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1410, in __getitem__
    return self._construct_dataarray(key)
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1319, in _construct_dataarray
    _, name, variable = _get_virtual_variable(self._variables, name, self.dims)
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 175, in _get_virtual_variable
    raise KeyError(key)
KeyError: '/geolocation/latitude'",
application: "/home/dockeruser/.local/bin/l2ss_harmony",
workerTimestamp: "2022-11-15T16:36:45.213858Z",
workerLevel: "ERROR"

It appears that because this is a grouped dataset, it is first getting "flattened" so that all variables are at the top level. However, because latitude and longitude variable names are being passed in from Harmony; l2ss-py is trying to use those variable names to index into the dataset. The variable names passed in from harmony still contain paths with / to signify the group but because the dataset has been flattened already, there are no groups and indexing fails.

@frankinspace frankinspace added the bug Something isn't working label Nov 15, 2022
@danielfromearth
Copy link
Collaborator

@frankinspace, do you know which version of the l2ss subsetter was run in this case?

@frankinspace
Copy link
Member Author

At the time reported, 2.1.1-rc.3 was deployed to UAT.

https://harmony.uat.earthdata.nasa.gov/versions shows which images are currently deployed (drop the .uat for ops)

@frankinspace
Copy link
Member Author

I've started a branch for this bug (https://github.com/podaac/l2ss-py/tree/issues/127) which at the moment just includes the TEMPO data in the list of test data unit tests get run against. Right now test_subset.test_subset_bbox, test_subset.test_specified_variables, test_subset.test_get_time_variable_name, and test_subset.test_subset_size are all failing for this dataset.

@danielfromearth
Copy link
Collaborator

danielfromearth commented Nov 18, 2022

@frankinspace, I haven't yet had a chance to look directly at the broader issue of variable names passed from harmony. But, regarding the failed test function, test_subset.test_subset_bbox() (which will hopefully shed light on that broader issue)...

Consider these lines in test_subset.test_subset_bbox() starting here, where an error is raised during the call to subset.compute_coordinate_variable_names(out_ds):

subset.subset(
    file_to_subset=join(data_dir, test_file),
    bbox=bbox,
    output_file=join(subset_output_dir, output_file)
)

out_ds = xr.open_dataset(join(subset_output_dir, output_file),
                         decode_times=False,
                         decode_coords=False,
                         mask_and_scale=False)

lat_var_name, lon_var_name = subset.compute_coordinate_variable_names(out_ds)

It seems to me that although the variables are being flattened within the call to subset.subset(), they are returned to a 'grouped' state by the time the subset.compute_coordinate_variable_names(out_ds) call is made. And since subset.compute_coordinate_variable_names() only looks at the top-level group that is passed to it, it doesn't find latitude and longitude, which are inside the geolocation group.

Some options are:

  • we could either change the compute_coordinate_variable_names() function to search a group hierarchy,
  • flatten the dataset again in this test function before calling compute_coordinate_variable_names()
  • explicitly grab the geolocation group, and pass only that to the compute_coordinate_variable_names() function

Thoughts? Does this sound reasonable to you?

@frankinspace
Copy link
Member Author

@danielfromearth yes I agree the compute_coordinate_variable_names function expects the dataset passed into it to already be 'flattened'.

I'm a bit torn on how to handle this one; the test is trying to just use compute_coordinate_variable_names as a convenience for determining the coordinate variables to verify the output of the subset call, it's not the focus of that test.

At the same time, it would be nice if compute_coordinate_variable_names could work on grouped datasets out of the box. But if we were to go that route, I would want to make sure both subset and compute_coordinate_variable_names flatten the dataset in the exact same way, which would require some moderate refactoring I think to pull the flattening out into it's own function/module (I think the h5 and netcdf flattening got separated at some point.)

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 6, 2022

@frankinspace,

The more I look at this, the more I think refactoring might not be the way to go. The compute_coordinate_variable_names function is expecting an already-flattened dataset. And if we allow compute_coordinate_variable_names to do the flattening, then we end up in a situation where either:

  1. the overall subset procedure does flattening twice (once at the very beginning of subset and then once again when computing variables) no matter what; or
  2. we have to include a check on whether flattening is necessary inside the compute_coordinate_variable_names function, and then do flattening if necessary.

Either way, if compute_coordinate_variable_names does flattening, this will probably require opening the original file (again) with NetCDF4 in append mode (since that's necessary for the group "walk") as well as using xarray, which is what compute_coordinate_variable_names uses now.

A 3rd option would be to avoid ever using compute_coordinate_variable_names outside of the overall subset function, i.e., avoid the subset-then-verify approach used in the test_subset.test_subset_bbox() function.


Thoughts? For instance, do you like the idea of checking for a group structure inside the compute_coordinate_variable_names function?

@frankinspace
Copy link
Member Author

Yes perhaps compute_coordinate_variable_names should be an internal-only function. The reason we started using it outside of the module itself was because it makes Jupyter notebooks and examples more consistent if they are able to use the exact logic for determining coordinate variables as the subsetter itself does.

@danielfromearth
Copy link
Collaborator

danielfromearth commented Dec 8, 2022

Okay, I've figured out a kind of workaround. Instead of refactoring compute_coordinate_variable_names(), I just separated <netcdf opening & group flattening> into a function and added that into the test suites just before compute_coordinate_variable_names() is called.

..you'll see that change in PR #130. Thoughts?

@danielfromearth
Copy link
Collaborator

(please pardon the quick off-topic question: is there a plan to add type hints throughout the code base? Is that something that would be worthwhile for another issue and PR?)

@frankinspace
Copy link
Member Author

(please pardon the quick off-topic question: is there a plan to add type hints throughout the code base? Is that something that would be worthwhile for another issue and PR?)

Yes I'm all for adding type hints

@danielfromearth
Copy link
Collaborator

Okay, there's now a new issue (#136) for following up on type hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants