Need a better way to get dataset keys #67

dougiesquire · 2023-05-16T01:16:25Z

Currently Intake-ESM dataset keys are constructed using a flaky approach of trying to redact time stamp information from filenames to construct a file id that is combined with the frequency to uniquely define a dataset, see e.g.

access-nri-intake-catalog/src/access_nri_intake/source/utils.py

Line 71 in 581633e

def redact_time_stamps(string, fill="X"):

I'm not sure how to reliably get dataset keys when the data are so messy. Going forward, better solutions might be to require that those generating model output:

include a file_id attribute in files that ids a file as part of a dataset
at least follow particular format when they have to include time info in the filename (e.g. YYYY-MM-DD or something we can be confident is a date)
...

The text was updated successfully, but these errors were encountered:

dougiesquire · 2023-06-18T11:50:14Z

Since #91, the redact_time_stamps function has been replaced by parse_access_filename which matches explicit regex patterns corresponding to known ACCESS output filenames to generate a file id (by redacting time stamps and replacing non-python characters) and extract any time information contained in the filename. This is hopefully a little more robust, but I'm leaving this issue open as things could still be better.

marc-white · 2024-08-01T05:04:07Z

@dougiesquire is there any reason that parse_access_filename and parse_access_ncfile can't become class methods of the BaseBuilder? We could then set them up to suck in builder-specific attributes for the filename patterns, etc.

dougiesquire · 2024-08-01T20:44:25Z

Is there value in having them as methods on the class since they don't mutate the state of the object? parse_access_ncfile could definitely receive some builder-specific info.

marc-white · 2024-08-01T23:40:20Z

Is there value in having them as methods on the class since they don't mutate the state of the object?

I think so. Having them as a @classmethod means that:

The functions remain available to be called elsewhere as BaseBuilder.parse_access_* as required;
The standard function inputs can be updated simply by updating a class attribute that lists the patterns for that particular builder (so we've achieved the goal of separating the file patterns for each builder);
Conceptually, I think it makes sense that the functions for parsing and altering the file names 'belong' to the Builder classes, given it is those classes exclusively (I think?) using them.

I don't know if this is likely in the future, but it also has the added bonus of making life easier if a Builder ever needs a customized parse_access_*, given we can override the inherited parse_access_* (instead of starting a proliferation of special helper functions).

dougiesquire · 2024-08-02T01:24:36Z

I'm probably being dense, but I'm not sure how it makes sense for those to be @classmethods (@staticmethods I could understand). Could you open a PR so show what you mean and we could discuss there?

paolap · 2024-08-08T05:13:50Z

Just noticed this, I actually used a similar approach in my fork to add a builder class for data psot-processed with our tool mopper:

https://github.com/paolap/access-nri-intake-catalog/blob/aus2200/src/access_nri_intake/source/builders.py

I made the parser method a class method so I could pass, for example
"fpattern"
this is something I can then define in

https://github.com/paolap/access-nri-intake-catalog/blob/aus2200/config/access-mopper.yaml

fpattern: "{version}/{frequency}/{variable}/{variable}_{model}_{exp_id}_{frequency}"

This information then gets used to create a regex string.
In this way I can use the same builder for any dataset that comes out of mopper, as mopper uses a similar pattern to define the directory structure and filenames.

The dates in the filenames for me are defined by CMOR itself when it writes the data so I don't need to pass that information as I know the logic already.
Hope this makes sense but it worked for me.

dougiesquire added the help wanted Extra attention is needed label Jun 13, 2023

dougiesquire mentioned this issue Jun 16, 2023

Improve how filenames are parsed to generate file id #91

Merged

marc-white self-assigned this Jul 31, 2024

marc-white mentioned this issue Aug 2, 2024

Refactor parse_access_* functions into BaseBuilder class #181

Merged

marc-white linked a pull request Aug 2, 2024 that will close this issue

Refactor parse_access_* functions into BaseBuilder class #181

Merged

marc-white closed this as completed in #181 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a better way to get dataset keys #67

Need a better way to get dataset keys #67

dougiesquire commented May 16, 2023 •

edited

Loading

dougiesquire commented Jun 18, 2023

marc-white commented Aug 1, 2024 •

edited

Loading

dougiesquire commented Aug 1, 2024

marc-white commented Aug 1, 2024

dougiesquire commented Aug 2, 2024

paolap commented Aug 8, 2024

Need a better way to get dataset keys #67

Need a better way to get dataset keys #67

Comments

dougiesquire commented May 16, 2023 • edited Loading

dougiesquire commented Jun 18, 2023

marc-white commented Aug 1, 2024 • edited Loading

dougiesquire commented Aug 1, 2024

marc-white commented Aug 1, 2024

dougiesquire commented Aug 2, 2024

paolap commented Aug 8, 2024

dougiesquire commented May 16, 2023 •

edited

Loading

marc-white commented Aug 1, 2024 •

edited

Loading