Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of filters in yaml's #409

Open
rtodling opened this issue Aug 14, 2024 · 7 comments
Open

Order of filters in yaml's #409

rtodling opened this issue Aug 14, 2024 · 7 comments
Assignees
Labels
core development design related issues and improvements

Comments

@rtodling
Copy link
Contributor

As most of you know, the order of filters in the observers matters. Certain actions must be done first, others later, etc. This issue requests that we work to make sure the following two things are done:

  1. That the order of the filters as implemented in the swell yamls is indeed the order they should be applied - logically consistent and also consistent w/ what GSI does.

  2. That the machinery that processes the observation yamls and creates the final yaml for the variational analysis, or UFO, of HOF, does not get shuffled or alphabetically organized. Within a given filter, the order of statements is not important, but the sequence of each filter is of fundamental importance and it should not be changed.

@Dooruk Dooruk added the core development design related issues and improvements label Aug 14, 2024
@Dooruk
Copy link
Collaborator

Dooruk commented Aug 14, 2024

This is duplicate of this issue:

#330

Was there an error you received with a UFO YAML filter? As far as I know, order shouldn't matter for YAMLs, indents and the way they are written as list matter.

Follow-up after the meeting today:

The order of UFO filters matter, JEDI/UFO apply some sort of hierarchy (that is controlled by obs pre filters, obs prior filters and obs post filters options) but this doesn't necessarily ensure full consistency. We should keep the order the see if that helps with the issues GEOS-FP group is encountering.

https://jointcenterforsatellitedataassimilation-jedi-docs.readthedocs-hosted.com/en/latest/inside/jedi-components/ufo/qcfilters/introduction.html#order-of-filter-application

@Dooruk
Copy link
Collaborator

Dooruk commented Aug 15, 2024

The solution might be to use OrderedDict which require a few changes for OOPS YAMLs rendering.

There are multiple locations in SWELL where yaml.safe_load is used, however we could change the ones that require the order to be kept, i.e. RunJedi... tasks and see if that works.

For instance, I don't think the order matters for EVA dictionaries, @asewnath do you agree?

@asewnath
Copy link
Contributor

@Dooruk agreed, order doesn't matter for the eva configs

@ashiklom
Copy link
Collaborator

Point 2 here is a duplicate of (or at least, closely related to) #330. That one is effectively an aesthetic/organizational issue (albeit still a legitimate one with UX implications!) with the order that Swell config files are read and written, but doesn't impact the order in which tasks are actually executed.

However, Point 1 is a different and more important issue. That's about actually making sure that specific tasks (in this case, filters) get executed in a specific order. Trivial example: sqrt(mean(x)) does not give the same answer as mean(sqrt(x)) -- the order in which we apply the mean and sqrt filters matters.

OrderedDict helps, but we have to make sure it's applied consistently, which may be especially tricky in nested data structures.

Lists have a meaningful order. So, one way to enforce this in YAML is to make sure we're always using lists rather than dicts; i.e., the following YAML:

x1:
  a: 5
  b: 6
  c: 7
x2:
  - g: 5
  - h: 6
  - i: 7

...produces the following results:

import yaml
with open("example.yaml", "r") as f:
    dat = yaml.safe_load(f)
# {'x1': {'a': 5, 'b': 6, 'c': 7}, 'x2': [{'g': 5}, {'h': 6}, {'i': 7}]}

Note that dat["x1"] is a list (order preserved) and dat["x2"] is a dict (order not 100% guaranteed to be preserved).

Strictly speaking, a better data structure for capturing order-specificity is probably a Directed Acyclic Graph (DAG). graphlib from the Python standard library supports these:

# `a` depends on `b` and `d`, but the order doesn't matter
# `d` depends on `b`
# `b` depends on `c`
x = {
    "a": {"b", "d"},
    "b": {"c"},
    "d": {"b"}
}

# In what order should we do things?
import graphlib
xts = graphlib.TopologicalSorter(x)
tuple(xts.static_order())
# ('c', 'b', 'd', 'a')

@Dooruk
Copy link
Collaborator

Dooruk commented Aug 16, 2024

Ok, sounds like OrderedDict and graphlib are two potential options.

@ashiklom, does the option you provide require dependency keys? Because I don't think the filters we talk about have them. Without that would graphlib be useful?

I've done some further digging to see what JCSDA is doing, came up with two more options.

  • JCSDA/Skylab uses a workflow ecosystem called "EWOK", and they use the YAML class within ruamel. I don't think this in our gmao-swell spack environment (see following search result), but we could request it the next spack release.

An example code block (model formats are just define datetime):

from ruamel.yaml import YAML

.
.
.

def substitute_template_file(template_file_path, ctx):
    def _load_template(template):
        try:
            if Path(template).is_file():
                with open(template) as f:
                    return f.read()
            else:
                return template
        except:
            if isinstance(template, str):
                return template
    yaml = YAML(typ='rt')
    jenv = jinja2.Environment(loader=jinja2.FunctionLoader(_load_template))
    jenv.filters['jedifnformat'] = jedifnformat
    jenv.filters['geosformat'] = geosformat
    jenv.filters['mpasformat'] = mpasformat
    jenv.filters['ufsformat'] = ufsformat
    yamlfile = jenv.get_template(str(template_file_path)).render(**ctx)
    return NamedDict(yaml.load(yamlfile))

rt is roundtrip parsing, and preserves order and they use NamedDict.

@rtodling
Copy link
Contributor Author

I tested what Doruk had in the branch earlier today -and:

Results from keeping the yaml ordering faithful to what is in the source code do change, but only in the tenth decimal place for the norm reduction (after 60+ iterations) ... in any case, I believe it's better for python/swell not to re-order the yaml entries. If nothing else, it is so much easier to map what we put in the source code/swell and what actually gets used - easier to read the final yaml.

@Dooruk
Copy link
Collaborator

Dooruk commented Sep 25, 2024

I understand we might be moving away from YAMLs in the longer term but just an update on this. ruamel is in the JCSDA unified environment (so we could load as an extra module in modules) but being added to gmao-swell environment as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core development design related issues and improvements
Projects
None yet
Development

No branches or pull requests

6 participants