Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Sub-config overhaul #66

Open
gasteigerjo opened this issue Oct 22, 2021 · 8 comments
Open

RFC: Sub-config overhaul #66

gasteigerjo opened this issue Oct 22, 2021 · 8 comments

Comments

@gasteigerjo
Copy link
Collaborator

gasteigerjo commented Oct 22, 2021

Overview

This RFC proposes to change how we handle sub-configs in order to make them more flexible. Instead of treating a sub-config as a sub-experiment that is just run as is, this proposal is about using them as definitions that can then be flexibly used in the remaining config.

Example

# Define sub-configs to use them later
nitrogen:
  ...
cyclobutadiene:
  ...
short_training:
  ...
long_training:
  ...
model1:
  ...
model2:
  ...

# The user has to specify how to use the sub-configs
load:
  - model1  # directly specifying a sub-config means always using it (like `fixed`)
  - choice:
    - nitrogen
    - cyclobutadiene
  - choice:
    - short_training
    - long_training

grid:
  other_attribute:
    type: choice
    options:
      - True
      - False

Considerations

We currently assume that always exactly one sub-config is chosen. This proposal allows users to define themselves how they want to combine sub-configs. Specifying them all in a grid would replicate the behavior we had previously, so this is a natural extension.

A sub-config can itself contain fixed/grid/random blocks and further sub-configs, just like they do now. The outer set of configs is then combined with the inner set of configs (outer product), just like now. I.e. loading a single sub-config can result in multiple configs, if the sub-config contains a grid.

Including separate files

This also naturally extends to loading separate config files into one config. The other config file can be treated exactly the same way as a subconfig. The user would only have to provide a keyword, specifying that this is now an external file. I.e.

# Define sub-configs
nitrogen:
  ...
model2:
  ...

load:
  - model2
  - file: ~/test/general.yaml  # A full config file, with fixed, grid, and random blocks.
  - choice:
    - nitrogen
    - simple_file: ~/test/dataset2.yaml  # A simple config file, which only containes fixed parameter values, similar to what Sacred uses.

Possible variant

While defining sub-configs seems pretty clear (that's exactly how we do it currently), specifying their usage still remains an open question. Instead of creating a new top-level block load, we could also integrate loading sub-configs into the fixed/grid/random blocks by introducing a special load keyword. However, this is much more difficult, considering the restrictions of yaml (every key has to be unique).

fixed:
  model1: load  # Always use everything from one sub-config

grid:
  dataset:  # The name doesn't do anything, but it has to be unique (due to yaml)
    type: load  # Specifies that we want to load sub-configs
    options:
      - nitrogen
      - cyclobutadiene
      - file: ~/test/general.yaml
      - simple_file: ~/test/dataset2.yaml
  training:
    type: load
    options:
      - short_training  # Multiple sub-configs can be combined
      - long_training

I prefer the original suggestion. It also seems to align better with user expectations. With this variant, loading a sub-config in the fixed block could lead to multiple different configs, which might be unexpected.

Open questions

There are 2 open questions:

  • What is the best way of specifying load for fixed/grid/random? How do we best specify importing files in these 3 cases? -> No random; use keywords fixed and choice.
  • How to best handle conflicting parameters? We could (1) always raise an error if there are conflicting parameters in separate sub-configs or the config vs. the sub-config. Or (2) we handle them like we currently do, by treating sub-configs as a hierarchy and using the deepest parameter definition, if there is a purely descending path between them (but show a warning). Otherwise, we raise an error.
@gasteigerjo gasteigerjo changed the title Sub-config overhaul RFC: Sub-config overhaul Oct 22, 2021
@n-gao
Copy link
Collaborator

n-gao commented Oct 23, 2021

I find the inclusion of loading subconfigs in fixed very unintuitive if the subconfig contains a grid or random. For my understanding, everything in fixed should represent one parameter setting and should not include dimensions for the cartesian product. Possible consequences would be to either throw an error if something loaded in fixed contains a grid or to simply not permit sub-configs in fixed.

@gasteigerjo
Copy link
Collaborator Author

gasteigerjo commented Oct 24, 2021

So what do you think about the primary config variant I've proposed? Note that its syntax is different from what we've discussed.

@n-gao
Copy link
Collaborator

n-gao commented Oct 25, 2021

The separation between parameters and subconfigs seems like a cleaner solution.
However I still have some questions:

  1. What kind of keywords should be allowed in load?
  2. Is the cartesian product taken of all configurations in load?
  3. If the cartesian product is taken of all configurations in load what does grid do, should it be choice?
  4. So, in the example above if model2 contains 2 configs, nitrogen and cyclobutadiene 4 each and short_training and long_training 1 and 3, respectively, is the total number of configurations randomly 2*4*4*3 or 2*4*4*1 or does random draw independently for each of the 2*4*4 grid configurations?

@gasteigerjo
Copy link
Collaborator Author

gasteigerjo commented Oct 25, 2021

In general, everytime a sub-config is specified, that means that a cartesian product of all of its inner configs is taking with the outer config. This is always the same behavior. The keywords only specify how the sub-configs themselves are chosen.

load should again contain keywords for specifying the 3 variants fixed, grid, and random. Imho, fixed doesn't really need to be explicitly specified, so you can instead directly specify the sub-config. That's what the first entry does. grid uses every sub-config once. The random block was still missing a samples parameter.

An example would then look like this:

nitrogen:
  ... [4 configs]
cyclobutadiene:
  ... [5 configs]
short_training:
  ... [1 configs]
long_training:
  ... [3 configs]
model1:
  ... [2 configs]

load:
  - model1
  - grid:
    - nitrogen
    - cyclobutadiene
  - random:
    samples: 1
    options:
      - short_training
      - long_training

With this we have 2*(4+5)*1 or 2*(4+5)*3 configs in total (depending on the config chosen by random).

Maybe we don't even want a random block for selecting sub-configs, since this is not really useful here?

The point is that you can now combine multiple grids to get their outer product. E.g.

nitrogen:
  ... [4 configs]
cyclobutadiene:
  ... [5 configs]
short_training:
  ... [1 configs]
long_training:
  ... [3 configs]
model1:
  ... [2 configs]

load:
  - model1
  - grid:
    - nitrogen
    - cyclobutadiene
  - grid:
    - short_training
    - long_training

would yield 2*(4+5)*(1+3) configs.

@n-gao
Copy link
Collaborator

n-gao commented Oct 29, 2021

Thank you for the clarifications. I would vote in favor of disallowing random in load, it seems very unnatural to get a random number of configurations. I would also vote in favor of renaming grid to choice. Imho, grid implies that the grid of the inner configurations is taken, so in your example 2*(4*5)*(1*3) which in itself would be identical to not having a keyword at all. choice also aligns with the choice block in the existing grid.

On another note, I would appreciate a clearer definition of what's allowed there. May imported files contain sub configurations? And how would these be treated?

Despite these discussions, I like this style of configuration!

@gasteigerjo
Copy link
Collaborator Author

gasteigerjo commented Nov 2, 2021

Those are great points! I agree with both. So the new config style would look like this:

nitrogen:
  ... [4 configs]
cyclobutadiene:
  ... [5 configs]
model1:
  ... [2 configs]

load:
  - model1
  - choice:
    - nitrogen
    - cyclobutadiene
  - choice:
    - file: ~/test/short_training.yaml
    - simple_file: ~/test/long_training.yaml

Sub-configs are hierarchical, and can contain further sub-configs. The syntax is just like in the main config, just indented once. In other words: Everything is perfectly hierarchical and self-similar. The same is true for normal imported files, which are treated exactly the same as sub-configs -- except for the special seml and slurm dictionaries. The special seml and slurm dictionaries are ignored in imported files and disallowed in sub-configs.

The simple_file is a different story, of course. It can only contain fixed values.

Imho, this way of specifying imports also seems fine. So the only remaining question would then be conflicting parameters. Thoughts?

@n-gao
Copy link
Collaborator

n-gao commented Nov 3, 2021

Regarding the import of files may one only import a single sub config from another file? Something like this:

...
load:
  - file: ~/test/trainings.yaml:short
...

or do we always include all configurations (the ones one would get for seml <collection> add config.yaml)? One can always achieve the same behavior by splitting the larger config into smaller ones which may either encourage a better organization or many small files (though, I doubt this features is going to get used that extensively).

Regarding conflicting parameters, on the one hand, I could see it potentially becoming unclear which parameter overwrites which if we have nested imports. But on the other hand, I would like to keep the functionality to overwrite global parameters but that seems inconsequential.

@gasteigerjo
Copy link
Collaborator Author

I don't see any issue with optionally importing sub-configs from another file, if you find that useful. It's a simple extension.

load:
  - file: ~/test/models.yaml  # Import full config
  - file: ~/test/trainings.yaml:short  # Import sub-config

I think I'm in favor of option 2 I've described earlier. That would still allow overriding global parameters, and proper warnings should make this transparent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants