Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a better way to get dataset keys #67

Closed
dougiesquire opened this issue May 16, 2023 · 6 comments · Fixed by #181
Closed

Need a better way to get dataset keys #67

dougiesquire opened this issue May 16, 2023 · 6 comments · Fixed by #181
Assignees
Labels
help wanted Extra attention is needed

Comments

@dougiesquire
Copy link
Collaborator

dougiesquire commented May 16, 2023

Currently Intake-ESM dataset keys are constructed using a flaky approach of trying to redact time stamp information from filenames to construct a file id that is combined with the frequency to uniquely define a dataset, see e.g.

def redact_time_stamps(string, fill="X"):

I'm not sure how to reliably get dataset keys when the data are so messy. Going forward, better solutions might be to require that those generating model output:

  • include a file_id attribute in files that ids a file as part of a dataset
  • at least follow particular format when they have to include time info in the filename (e.g. YYYY-MM-DD or something we can be confident is a date)
  • ...
@dougiesquire
Copy link
Collaborator Author

Since #91, the redact_time_stamps function has been replaced by parse_access_filename which matches explicit regex patterns corresponding to known ACCESS output filenames to generate a file id (by redacting time stamps and replacing non-python characters) and extract any time information contained in the filename. This is hopefully a little more robust, but I'm leaving this issue open as things could still be better.

@marc-white marc-white self-assigned this Jul 31, 2024
@marc-white
Copy link
Collaborator

marc-white commented Aug 1, 2024

@dougiesquire is there any reason that parse_access_filename and parse_access_ncfile can't become class methods of the BaseBuilder? We could then set them up to suck in builder-specific attributes for the filename patterns, etc.

@dougiesquire
Copy link
Collaborator Author

Is there value in having them as methods on the class since they don't mutate the state of the object? parse_access_ncfile could definitely receive some builder-specific info.

@marc-white
Copy link
Collaborator

Is there value in having them as methods on the class since they don't mutate the state of the object?

I think so. Having them as a @classmethod means that:

  • The functions remain available to be called elsewhere as BaseBuilder.parse_access_* as required;
  • The standard function inputs can be updated simply by updating a class attribute that lists the patterns for that particular builder (so we've achieved the goal of separating the file patterns for each builder);
  • Conceptually, I think it makes sense that the functions for parsing and altering the file names 'belong' to the Builder classes, given it is those classes exclusively (I think?) using them.

I don't know if this is likely in the future, but it also has the added bonus of making life easier if a Builder ever needs a customized parse_access_*, given we can override the inherited parse_access_* (instead of starting a proliferation of special helper functions).

@dougiesquire
Copy link
Collaborator Author

I'm probably being dense, but I'm not sure how it makes sense for those to be @classmethods (@staticmethods I could understand). Could you open a PR so show what you mean and we could discuss there?

@paolap
Copy link

paolap commented Aug 8, 2024

Just noticed this, I actually used a similar approach in my fork to add a builder class for data psot-processed with our tool mopper:

https://github.com/paolap/access-nri-intake-catalog/blob/aus2200/src/access_nri_intake/source/builders.py

I made the parser method a class method so I could pass, for example
"fpattern"
this is something I can then define in

https://github.com/paolap/access-nri-intake-catalog/blob/aus2200/config/access-mopper.yaml

fpattern: "{version}/{frequency}/{variable}/{variable}_{model}_{exp_id}_{frequency}"

This information then gets used to create a regex string.
In this way I can use the same builder for any dataset that comes out of mopper, as mopper uses a similar pattern to define the directory structure and filenames.

The dates in the filenames for me are defined by CMOR itself when it writes the data so I don't need to pass that information as I know the logic already.
Hope this makes sense but it worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants