Skip to content

Configuration file

Daniel Svensson edited this page Jun 25, 2019 · 13 revisions

A configuration file that defines the pipeline needs to be supplied to doepipeline at runtime. The config file is written in YAML. An example config file is available below.

Required key-value pairs in the config file

The following key-value pairs are required in the config file:

design

Required. Specifies the factors and responses to investigate, as well as what design type to use in the optimization phase. Valid keys in design are:

  • type: Required. Design to use. Currently, central composite faced (CCF) is available.
  • factors: Required. Specify one or more factors.
    • <factor-name>: Keys are names used for the factors, and will be used for substitutions into scripts. Values are specified below.
  • responses: Required. Specify one or more responses.
    • <response-name>: Keys are names used for responses, values are specified below.

<factor-name>

Specification of a factor. Valid keys specifying factors are:

  • type Optional. Type will be quantitative if not specified. Valid values are:
    • quantitative: Default. Numeric factor which can take real values.
    • ordinal: Numeric factor constrained to integer values.
    • categorical: Categorical values. Must specify which values the factor can take (see 'values' below).
  • low_init: Required for numeric factors. Low starting value.
  • high_init: Required for numeric factors. High starting value.
  • max: Required for numeric factors. Maximum (global) allowed value.
  • min: Required for numeric factors. Minimum (global) allowed value.
  • values: Required for categorical factors. List of values that the factor is constrained to.
  • screening_levels: Optional for numeric factors. Number of levels investigated during screening phase. Default is 5. Different factors may have different values for screening levels.

The low_init and high_init are only used when --skip-screening flag is set in the doepipeline command. They set the initial design space for the optimization design (see figure 1 b in the publication), as this must be smaller than when using the GSD. The min and max define the global design space, which is spanned by the GSD design space in the screening phase, and which the designs in the optimzation phase must always keep within. However, we always recommend to run the screening step.

<response-name>

Specification of a response. Valid keys specifying responses are:

  • criterion: Required for all responses. Valid values are:
    • maximize / minimize: Response will be maximized or minimized respectively.
    • target: Reach target value (optimum is neither above nor below value).
  • target: Required when there are multiple responses.
  • low_limit: Required when there are multiple responses, for responses with criterion targetor maximize, optional otherwise. Indicates lowest acceptable value.
  • high_limit: Required when there are multiple responses, for responses with criterion target or minimize, optional otherwise. Indicates highest acceptable value.

results_file

Required. Indicates the name of the file containing the results from each pipeline run. A results-file will be produced in the working-directory for each experiment. The results-file must contain the response values in the following format:

RESPONSE_1,VALUE
RESPONSE_2,VALUE
RESPONSE_N,VALUE

working_directory

Required. Root directory which will contain the results from all iterations and experiments.

pipeline

Required. Ordered list specifying the order of pipeline steps/jobs. Values are <job-name>.

<job-name>

Specification of a pipeline step/job that will be run in the pipeline. Optionally, factors will be substituted. For each step specified in pipeline there must be a <job-name> specification.

  • script: Required. String containing the command that will be executed in this step. May use substitution (see below) to parameterize commands. The contents of script is formatted (factors substituted) and output to a bash file that will be executed by doepipeline.
  • factors: Optional. Contains mapping of what factors that should be used in the current experiment. Valid keys are:
    • <factor-name>: One of the factors specified in design. Each factor must carry the following key-value pair to indicate that the value of the factor should be substituted into the script.
      • substitute: true: Factor will be substituted using templating.
  • slurm: Optional. When specified, doepipeline will submit the script to the SLURM workload management system. Contains the SLURM specific parameters to be used in the submission of the script. The bash script will contain the specified SLURM paramters and be submitted with sbatch. See the specification for 'MySecondJob' in the example configuration file below.

Specifying factors within pipeline steps

doepipeline uses a simple templating system for substituting factors and other values into scripts. Factors that should be substituted is wrapped in {% ... %}.

Factors are substituted using their names specified in the design. For example, if the value of FactorA should be passed as an argument to my_script.sh, the script specified in the pipeline step should be written as my_script.sh {% FactorA %} - doepipeline will then substitute the template with the value of the factor according to the experimental design.

Example configuration file

design:
    # The type of design to use.
    type: ccf

    factors:
        # Specify each factor that you wish to use in the optimization.
        # Change the names to your liking.
        FactorA:
            min: 0.0                # Global minimum value.
            max: 40.0               # Global maximum value.
            low_init: 0.0           # Initial low setting.
            high_init: 20.0         # Initial high value.
            type: quantitative      # The type of factor.

        FactorB:
            min: 0
            max: 10
            low_init: 3
            high_init: 7
            type: ordinal

        FactorC:
            # For a categorical (qualitative) factor, specify the different categories like this:
            values:
              - 'stringent_filter'
              - 'lenient_filter'
              - 'no_filter'
            type: categorical

    responses:
        # Specify each response that you wish to use in the optimization.
        # Change the names to your liking.
        ResponseA:
            criterion: maximize     # Maximize/minimize


# File where final results are written.
results_file: my_results.txt


# The working directory where all run files will be written.
working_directory: ~/my_work_directory


pipeline:
    # Specifies order of pipeline steps/jobs. These names must match the job-names below.
    - MyFirstJob
    - MySecondJob
    - MyThirdJob


MyFirstJob:
    # The script that will be executed is specified here.
    script: >
        bash my_first_script.sh --parameter {% FactorA %}

    factors:
        # The factors must match the factors in the design section above.
        FactorA:
            substitute: true


MySecondJob:
    # The script can be multi-line. First activate a conda environment, then execute my_program
    script: >
        activate my_conda_env && \
        my_program \
            --parameter1 {% FactorB %} \
            --parameter2 {% FactorC}

    factors:
        FactorB:
            substitute: true
        FactorC:
            substitute: true

    # This pipeline step should be run through SLURM.
    # Set the slurm parameters as you would otherwise do.
    slurm:
        A: <PROJECT>
        c: 2
        t: 00:40:00
        o: second_job_slurm.out
        e: second_job_slurm.error


MyThirdJob:
    # Make sure to save the response to the results_file.
    script: >
        python make_output.py -o {% results_file %}