Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making coordinate variables searchable #201

Open
charles-turner-1 opened this issue Sep 27, 2024 · 1 comment
Open

Making coordinate variables searchable #201

charles-turner-1 opened this issue Sep 27, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link

charles-turner-1 commented Sep 27, 2024

Is your feature request related to a problem? Please describe.

Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for st_edges_ocean will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:

  1. If the coordinate variable is know to be stored in a netCDF file with specific data variables:
    • The user needs to know that these data & coordinate variables are found in the same files, and then search for the data variable in order to access the coordinate variables.
    • In some instances, the user can directly access the coordinate variables by searching the data variables. In others, they need to perform something like
        VARNAME = ...
        COORD = ...
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        ds = xr.open_dataset(fname)
        coord = ds[COORD]

Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.

  1. In other instances, coordinate variables are stored completely separately. For example, ocean_grid.nc files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        dirname = Path(fname).parent
        grid_file, = [file for file in os.listdir(dirname) if file.endswith('_grid.nc')] 
        ds = xr.open_dataset(grid_file)
        coord = ds[COORD]

This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.

This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

Describe the feature you'd like

Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:

var = cat.search(variable=VARN).to_dask()
coord = cat.search(coord=COORD).to_dask()

Doing this is subject to a couple of constraints:

  1. The catalog needs to know that coordinates & data variables aren't the same & need to be treated differently - xr.combine_by_coords will fail if passed a coordinate variable.
  2. This requires an upstream patch of intake-esm - see issue 660.
  3. Cannot cause serious performance regressions - see above issue.

Proposed Solution

  1. Update intake-esm with @dougiesquire's proposed solution from issue 660.
  2. Add separate coordinate coordinate variable fields to the ACCESS-NRI Intake Catalog, rather than just making the same change as in Intake-ESM (data_vars => variables), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.

Additional Info

  • Due to the release cycle of Intake-ESM, this solution will probably require us to maintain a fork - at least for some time.
  • I've performance tested the proposed solution & changes in catalogue build times are small (typically ~5%), catalogue read times similar (typically 5-10%, sometimes faster), and the size of datastore.csv.gz files writted by builder.save() are typically approximately doubled.
@charles-turner-1 charles-turner-1 added the enhancement New feature or request label Sep 27, 2024
@charles-turner-1 charles-turner-1 self-assigned this Sep 27, 2024
@aidanheerdegen
Copy link
Member

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.

Add separate coordinate variable fields to the ACCESS-NRI Intake Catalog

I'm confused. Does this mean

  1. Have a "coordinate" flag (field)?
  2. Move coordinates into a separate catalogue?
    or neither?

BTW this is a somewhat related issue I think about encoding grid information:

#112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants