Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Add local module named configuration template #4422

Closed
pditommaso opened this issue Oct 17, 2023 · 9 comments
Closed

[proposal] Add local module named configuration template #4422

pditommaso opened this issue Oct 17, 2023 · 9 comments

Comments

@pditommaso
Copy link
Member

pditommaso commented Oct 17, 2023

Context

The use of DSL2 introduces the use of module components that allow the inclusion and re-use a process more than one time with different names.

A common requirement in this scenario is that a process configuration needs to be customised, for example, to provide a specific publishDir path or command arguments.

Currently, this need is addressed by creating a monolithic configuration file in which the process configuration is provided by using the target process name. For example here

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L174-L192

In some case, there's a conditional logic that needs to be duplicated in the workflow script. For example here:

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L209-L224

This approach has several disadvantages:

  • Monolithic configuration file
  • The process configuration is out of the context from the workflow code
  • Very fragile; it's enough a typo in the config file to break the process configuration
  • The use of monolithic breaks the workflow modularization
  • Very hard to maintain.

Solution

A possible solution could be to introduce a module-level configuration file that allows the definition of named configuration template that can applied to one or more processes in the same module.

For example the following configuration file could be added in a workflow module:

config.FOO.ext.args = "--x one --y two" 
config.FOO.publishDir = [path: "/some/path"]

config.BAR.ext.args = "--x three --y four" 
config.BAR.publishDir = [path: "/other/path/${params.suffix}"]

Note: in the example above, FOO and BAR are arbitrary identifiers the user chooses. They have no relationship with the process name where they are meant to be used.

In the workflow module, they could be used using the following syntax:

include { PROC1 } from './module/path1'
include { PROC2 } from './module/path2'

workflow NAME {
  PROC1(config.FOO)
  PROC2(config.BAR) 
}

Note: explorative syntax, it may change

In the above example, the process PROC1 is executed having the config setting defined by the template with name FOO is applied. Same with with process PROC2 with config BAR.

Conditional configuration could be defined using usual if and expression statement e.g.

workflow NAME {
  def cfg = params.something ? config.FOO : config.BAR
  PROC1(cfg)
}

This approach has several advantages:

  • keep the config settings in the nextflow config file, aside the module where it's expected to be used
  • promote a declarative approach for configuration files
  • decouple the configuration from the target process name and therefore, increase reusability
  • Avoid the verbosity of withNameselector mechanism

Caveats

  • Named config templates should only accessible in the module where they are defined
  • A conventional config file should used e.g. module.config or main.config
  • The withName selector should be able to override settings defined by a config template (to not break existing config e.g. groundswell)
@pditommaso pditommaso changed the title Add module local named configuration template [proposal] Add module local named configuration template Oct 17, 2023
@pditommaso
Copy link
Member Author

See also #4322

@pditommaso
Copy link
Member Author

Tagging @bentsherman @drpatelh

@pditommaso pditommaso changed the title [proposal] Add module local named configuration template [proposal] Add local module named configuration template Oct 17, 2023
@pditommaso
Copy link
Member Author

Possible implementation:

  1. Load the module.config when parsing the module including
  2. The module config should the parsed with ConfigParser, adding the resolved params in the evaluation context
  3. Add the module config object in the ScriptMeta
  4. Apply the config if the setting are no otherwise specified here

@bentsherman
Copy link
Member

See also: #4205

Additional notes from hackathon:

  • need to evaluate process selectors vs new proposed syntax for module config
  • nf-core modules will only come with a test.config to be loaded by the test harness like a normal config, not used in real pipelines
  • pipeline developers will install a module and add a nextflow.config / module.config to each module with pipeline-specific config
  • process modules will not have module config, only subworkflows and above, because they are invoking the processes
  • module config can be overwritten by outer subworkflows, then pipeline-level config, then user config, then CLI options

I don't like that the proposed syntax (which I'll call the "programmatic config") requires you to create a new name and reference it in the pipeline script, but I see why you did it, so that the config can be set programmatically in the pipeline code.

The question is whether this programmatic config is needed. All of the if statements in the modules.config are used only to replicate the workflow logic to target a specific process, but they are eliminated if the config is defined in the module scope, even with conventional process selectors.

My impression from talking to developers at the hackathon is that they like process selectors but they just need a way to use them in the context of a module, to ensure that the selector won't be applied to completely different processes which they didn't intend.

Both process selectors and the programmatic config should work equally well, but the former is more familiar while the latter imposes some extra steps with dubious value.

For example, suppose workflow A calls subworkflow B which calls process C, and A wants to apply some config to C (taking precedence over config from B). By nf-core convention, each component will reside in its own module. The programmatic config would look like this:

// A/module.config
CONFIG_C.ext.args = '--foo'

// A/main.nf
workflow A {
  B(CONFIG_C)
}

// B/module.config
CONFIG_C.ext.args = '--bar'
CONFIG_C.publishDir = params.outdir

// B/main.nf
workflow B {
  take:
  config_c_extra
  main:
  C.config = CONFIG_C + config_c_extra
  C()
}

// C/main.nf
process C {
}

This strikes me as more complicated than it needs to be. The user is basically re-implementing the logic of process selectors in their pipeline code. While this programmability enables things that are not possible with (declarative) process selectors, I don't think uses are asking for those things.

As a baseline, I will first implement the module config with process selectors. If it enables the config to be fully declarative then that should be enough. Perhaps then we can also try the programmatic config and see what users prefer.

@bentsherman
Copy link
Member

Rewriting the above example with process selectors:

// A/module.config
withName:'A:B:C' {
  ext.args = '--foo'
}

// A/main.nf
workflow A {
  B()
}

// B/module.config
withName:'B:C' {
  ext.args = '--bar'
  publishDir = params.outdir
}

// B/main.nf
workflow B {
  C()
}

// C/main.nf
process C {
}

Much simpler!

@pditommaso
Copy link
Member Author

Ok, we talked. Let's go ahead for the first iteration using standard process selector syntax applied to process *included* in the local module.

@bentsherman
Copy link
Member

Okay, the PR and rnaseq POC are working quite well. I was able to remove all of the if statements without triggering any config selector warnings, even for processes that aren't always used. This is because the module config is not applied to the process until the process is actually invoked, so there will never be any hanging configs. On the other hand, if you misspell a process selector then it should still warn you.

One minor detail is that the module config file name must always be the same, currently it is module.config. This is because Nextflow loads the module config by convention, the module script does not reference the config file directly.

As an alternative, we could do something like includeConfig for module scripts, which would allow the module script to explicitly import a config file by name:

// main.nf
includeConfig 'module.config'

process FOO {
  // ...
}

This way, users could name the config file whatever they want. On the other hand, it wouldn't make sense to support both approaches, so if we went this way then users would just have to specify the same include line of code many times. So I think I would rather not do this.

@ewels
Copy link
Member

ewels commented Jan 16, 2024

Related: made an issue about reducing config duplication of publishDir: #4661

@bentsherman
Copy link
Member

Closing, see #4510 (comment)

@bentsherman bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants