Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe yml structure #3

Open
BSchilperoort opened this issue Apr 6, 2023 · 5 comments
Open

Recipe yml structure #3

BSchilperoort opened this issue Apr 6, 2023 · 5 comments

Comments

@BSchilperoort
Copy link
Contributor

I have added an example recipe structure to the repository:

download:
  folder: /home/bart/Data/lsmdata/test/
  years: [1980, 2020]
  bbox: [3, 50, 6, 54]

  datasets:
    era5-land:
      frequency: hourly
      variables:
        - air_temperature  # will map to 2m_temperature...
          - height_m: 2  # optional extra argument
        - dewpoint_temperature
          - height_m: 2

convert:
  standard: ALMA
  flavor: PLUMBER2  # More specified than ALMA.
  folder: /home/bart/Data/lsmdata/output/
  frequency: 1H  # outputs at 1 hour frequency. Pandas-like freq-keyword.
  resolution: 0.01  # output resolution in degrees.

Any thoughts, @SarahAlidoost, @geek-yang ?

@SarahAlidoost
Copy link
Member

@BSchilperoort Thanks, it looks good, and it has the minimum required information. I think the sections can be reorganized. For example, I suggest using the same structure as springtime recipes for the 'datasets' section. Let's have a separate section for configurations. In the future, other configs like system settings can be added. I like having a documentation part as well. (see esmvaltool recipes and config-user for more). Here is my suggestion:

configurations:
  run_directory: /home/bart/Data/lsmdata/test
  download: True # /home/bart/Data/lsmdata/test/download_dir will be created

documentation:
  description:
    Example recipe that downloads two variables from era5_land data and converts
    them to ALMA format.

datasets:
  test:
    dataset: era5-land
      frequency: hourly
      years: [1980, 2020]
      area:
        name: test
        bbox: [3, 50, 6, 54]
      variables:
        - air_temperature  # will map to 2m_temperature...
          - height_m: 2  # optional extra argument
        - dewpoint_temperature
          - height_m: 2

converter: # /home/bart/Data/lsmdata/test/processed will be created
  convention: ALMA
  flavor: PLUMBER2  # More specified than ALMA.
  frequency: 1H  # outputs at 1 hour frequency. Pandas-like freq-keyword.
  resolution: 0.01  # output resolution in degrees.

@BSchilperoort
Copy link
Contributor Author

BSchilperoort commented Apr 6, 2023

Thanks for the ideas. I do like the documentation part.

I find having "datasets" and "dataset" a bit confusing. How about calling the first one a collection?

Additionally, as the goal is to prepare input data for land surface models, the area and years will be the same for most datasets. So by default the area and years should be set on the collection level, with the possibility of deviating from this for a specific dataset. For example:

collections:
  stemmus_scope_NL:
    years: [1980, 2020]
    area: [3, 50, 8, 54]
  
    dataset: era5-land
      frequency: hourly
      variables:
        - air_temperature

    dataset: CAMS
      years: [2004, 2020]  # overrides 'years' from collection level
      variables:
        - co2

    dataset: dummy_data
      years: [1980, 2003] # No data available for these years.
      variables:
        co2:
          unit: ppm
          value: 350

@geek-yang
Copy link
Member

I find having "datasets" and "dataset" a bit confusing. How about calling the first one a collection?

Or maybe more specific, "datasets" and "source"?

@BSchilperoort
Copy link
Contributor Author

Well, they are also source datasets. As we're making a superset of those, the term "collection" feels most apt to me. Or only "collections" and "sources" to avoid the word altogether.

But it would probably be best to avoid calling the result a new dataset, as the result should not be shared. Redistribution will probably violate some of the license agreements etc. We should be careful with licenses and properly attributing the sources, see also #4

@sverhoeven
Copy link
Member

What about catalog as a container of datasets. See https://schema.org/DataCatalog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants