Language improvements: Processes #3714

bentsherman · 2023-03-03T19:03:44Z

bentsherman
Mar 3, 2023
Maintainer

Last year, I created a discussion (#3107) to collect all of the language-related issues with Nextflow into one place and develop some solutions. I received a ton of excellent feedback, and while that discussion is not an official roadmap, it did help us sort through everything and figure out which pieces to prioritize. So thanks again to everyone who participated, and of course anyone is still welcome to add suggestions.

I'd like to use this discussion to show off some improvements related to Nextflow processes. We have several PRs under review, and we are still discussing them internally, but I would also like to collect my thoughts here in order to solidify my own understanding and to receive feedback from the community. So far the discussion has been fragmented across many different issues, so I think we need to have a single place to make sure that everything makes sense and fits together.

I will link the relevant issues and PRs for each feature. Each PR has at least one end-to-end pipeline script, so please check out those examples if you're curious!

Default values for inputs

PR: #3687
Issues: linked in PR

This feature adds a defaultValue option for process inputs. If you invoke a process with fewer arguments than are declared, Nextflow will using the defaultValue specified for the remaining inputs (or fail if they aren't defined). This way you can have some extra "optional" inputs that will use a sensible default if you don't need them in your workflow. The main caveat is that inputs with default values should be declared after inputs with no default, otherwise it's useless.

Should be useful for processes that have many potentially "optional" inputs, like this multiqc module. This process has many path inputs, but they could be declared with defaultValue: [] so that if you only need a few of those paths in your workflow, you don't have to specify those defaults yourself.

Optional inputs

PR: none
Issues: #1694, #3507

An "optional" process input is essentially an input with a default value of "null", and it would provide a similar benefit of not having to specify every single process input. However, I'm not sure that we need an explicit optional option for inputs like we have for outputs. I think the use cases for optional inputs can be covered by default values (see above) and nullable paths (see below). But I'm open to other arguments for it.

Named inputs

PR: #3712
Issues: Linked in PR

This feature adds a take option for process inputs, analogous to emit for outputs. I'm open to other names, but I'm just following the take/emit used by workflows. I don't think we can use name because of how the code works under the hood.

Named inputs allow you to pass inputs by name when calling a process, so you don't have to worry about the order in which they were declared. Combine it with default values and you'll really be cooking with gas.

As a bonus, if your input has a simple name like val foo, you can automatically refer to it as my_process(foo: ...) without specifying take explicitly. As a bonus bonus, this feature works for workflows too, so any channels declared in the take: section can be passed by name when you call a workflow: my_workflow(foo: ...).

This feature uses an existing Groovy syntax where any named arguments in a function call are collected into a map and passed as the first argument. Here's a simple example:

def printOptions(Map opts = [:], String name = 'User 1') {
  println "Hello, ${name}, here are your options:"
  opts.each { key, value ->
    println "  ${key}: ${value}"
  }
}

printOptions('Ben', foo: 'foo', bar: 'bar')

There is one caveat with the current approach, however. When a process or workflow is invoked, and the first argument is a map, we have to assume the first argument is that map of named options. But what if the user actually wanted to pass a map to the first process input, like a value channel? Nextflow will assume it is the named arguments map, and it will probably fail. Users can probably get around this ambiguity by designing their process inputs just right, but I have an idea for how to remove the ambiguity altogether as well. See my PR for details.

Arity for path inputs and outputs

PR: #3706
Issues: Linked in PR

This feature adds an arity option for path inputs and outputs. The "arity" (aka "cardinality") of a path is the number of files the path is expected to have. It can be a single number like 1 or 2, or a range like 0..1, 1..2, or 1..*.

It also solves a rather annoying quirk in Nextflow that people have been complaining about for years: a path input/output with a glob pattern (e.g. *.foo) will return a single item or a list depending on whether there is only one file. It's very annoying. But now if you define the arity, the path will return a single item only if the arity is "single", meaning it's excepted to have at most one file.

But also, you probably won't even need to specify the arity yourself, because it can infer a sensible default. If a path is a glob pattern, it will have an arity of 1..*, otherwise it's a single file so the arity will be 1. Similarly for optional outputs, the default arity will be 0..* or 0..1 respectively.

However, because this default arity fixes the single-item-or-list quirk, it is also a breaking change. You will have to remove whatever glue logic you added to handle the quirk, once you upgrade to a Nextflow version with this feature.

UPDATE: The default arity has been removed, so there is no breaking change. If the arity is not specified, the existing behavior is used.

Nullable path inputs and outputs

PR: #2893
Issues: Linked in PR

This feature adds a nullable option for path inputs and outputs. If a path output is declared with nullable: true and it's file is not produced by the task, it will emit a "null" path instead of failing. Similarly, a path input can receive such "null" paths if it is also declared nullable.

You might be wondering why we didn't call this feature optional, or how it works with optional outputs. The PR thread shows the back-and-forth we've had on these questions, but there is an important difference between the two. To illustrate, consider two path outputs, one marked optional and one marked nullable. What will they do if both of their files are absent? The nullable path will emit a "null" path; the optional path will emit nothing. See the difference? The nullable path can still trigger downstream computations, but the optional path cannot.

But the main purpose of this feature is to allow nested paths in a tuple to be "optional", because currently they can't be made optional on their own -- the entire tuple must be optional or not.

Additional thoughts

At this point, I'm quite happy with these PRs as they currently stand. I remain unconvinced that we need an optional: true for inputs, and I would like to resolve the ambiguity with named inputs. Having written everything out, I think all of these features should work together seamlessly. I think only the default values and named inputs PRs are touching the same bits of code, so we'll see what that merge conflict looks like.

I'd like to hear what other users think about these proposed features. Please check out the docs and pipeline example in each PR and reply to this discussion with any feedback!

Addendum: Named inputs with default values

I combined these branches and tested just to make sure they work, and they do. So the combined args parsing logic is as follows:

bind the named arguments to their corresponding inputs
bind the positional arguments to the remaining inputs in the order in which they were declared
a. if there are more positional arguments than remaining inputs, throw an error
for any excess remaining inputs, use the default value
a. if any of the excess inputs don't have a default value, throw an error

Here is an example script with the combined functionality. Notice that because every input can be passed by name, I can give a value for hisat2 without having to give a value for star.

process foo {
  input:
    val metadata
    path ('star/*'), take: 'star', defaultValue: []
    path ('hisat2/*'), take: 'hisat2', defaultValue: []
    path ('salmon/*'), take: 'salmon', defaultValue: []
  output:
    stdout
  script:
    """
    echo 'metadata: ${metadata}'
    [[ -d star ]] && ls star || echo 'skipping star directory'
    [[ -d hisat2 ]] && ls hisat2 || echo 'skipping hisat2 directory'
    [[ -d salmon ]] && ls salmon || echo 'skipping salmon directory'
    """
}


workflow {
  foo(metadata: 'foo', hisat2: []) | view
}

vruano · 2023-03-17T02:12:38Z

vruano
Mar 17, 2023

Thanks, this looks great. That is going to spare many a bit of a headache maintaining several signatures for the same underlying tool. Also it should allow automatic generation of nextflow "wrappers" for tools with parseable or introspectable argument specs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language improvements: Processes #3714

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Language improvements: Processes #3714

bentsherman Mar 3, 2023 Maintainer

Default values for inputs

Optional inputs

Named inputs

Arity for path inputs and outputs

Nullable path inputs and outputs

Additional thoughts

Addendum: Named inputs with default values

Replies: 1 comment

vruano Mar 17, 2023

bentsherman
Mar 3, 2023
Maintainer

vruano
Mar 17, 2023