Language improvements: Processes #3714
bentsherman
started this conversation in
Show and tell
Replies: 1 comment
-
Thanks, this looks great. That is going to spare many a bit of a headache maintaining several signatures for the same underlying tool. Also it should allow automatic generation of nextflow "wrappers" for tools with parseable or introspectable argument specs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Last year, I created a discussion (#3107) to collect all of the language-related issues with Nextflow into one place and develop some solutions. I received a ton of excellent feedback, and while that discussion is not an official roadmap, it did help us sort through everything and figure out which pieces to prioritize. So thanks again to everyone who participated, and of course anyone is still welcome to add suggestions.
I'd like to use this discussion to show off some improvements related to Nextflow processes. We have several PRs under review, and we are still discussing them internally, but I would also like to collect my thoughts here in order to solidify my own understanding and to receive feedback from the community. So far the discussion has been fragmented across many different issues, so I think we need to have a single place to make sure that everything makes sense and fits together.
I will link the relevant issues and PRs for each feature. Each PR has at least one end-to-end pipeline script, so please check out those examples if you're curious!
Default values for inputs
PR: #3687
Issues: linked in PR
This feature adds a
defaultValue
option for process inputs. If you invoke a process with fewer arguments than are declared, Nextflow will using thedefaultValue
specified for the remaining inputs (or fail if they aren't defined). This way you can have some extra "optional" inputs that will use a sensible default if you don't need them in your workflow. The main caveat is that inputs with default values should be declared after inputs with no default, otherwise it's useless.Should be useful for processes that have many potentially "optional" inputs, like this multiqc module. This process has many path inputs, but they could be declared with
defaultValue: []
so that if you only need a few of those paths in your workflow, you don't have to specify those defaults yourself.Optional inputs
PR: none
Issues: #1694, #3507
An "optional" process input is essentially an input with a default value of "null", and it would provide a similar benefit of not having to specify every single process input. However, I'm not sure that we need an explicit
optional
option for inputs like we have for outputs. I think the use cases for optional inputs can be covered by default values (see above) and nullable paths (see below). But I'm open to other arguments for it.Named inputs
PR: #3712
Issues: Linked in PR
This feature adds a
take
option for process inputs, analogous toemit
for outputs. I'm open to other names, but I'm just following thetake
/emit
used by workflows. I don't think we can usename
because of how the code works under the hood.Named inputs allow you to pass inputs by name when calling a process, so you don't have to worry about the order in which they were declared. Combine it with default values and you'll really be cooking with gas.
As a bonus, if your input has a simple name like
val foo
, you can automatically refer to it asmy_process(foo: ...)
without specifyingtake
explicitly. As a bonus bonus, this feature works for workflows too, so any channels declared in thetake:
section can be passed by name when you call a workflow:my_workflow(foo: ...)
.This feature uses an existing Groovy syntax where any named arguments in a function call are collected into a map and passed as the first argument. Here's a simple example:
There is one caveat with the current approach, however. When a process or workflow is invoked, and the first argument is a map, we have to assume the first argument is that map of named options. But what if the user actually wanted to pass a map to the first process input, like a value channel? Nextflow will assume it is the named arguments map, and it will probably fail. Users can probably get around this ambiguity by designing their process inputs just right, but I have an idea for how to remove the ambiguity altogether as well. See my PR for details.
Arity for path inputs and outputs
PR: #3706
Issues: Linked in PR
This feature adds an
arity
option for path inputs and outputs. The "arity" (aka "cardinality") of a path is the number of files the path is expected to have. It can be a single number like1
or2
, or a range like0..1
,1..2
, or1..*
.It also solves a rather annoying quirk in Nextflow that people have been complaining about for years: a path input/output with a glob pattern (e.g.
*.foo
) will return a single item or a list depending on whether there is only one file. It's very annoying. But now if you define thearity
, the path will return a single item only if thearity
is "single", meaning it's excepted to have at most one file.But also, you probably won't even need to specify the arity yourself, because it can infer a sensible default. If a path is a glob pattern, it will have an arity of1..*
, otherwise it's a single file so the arity will be1
. Similarly for optional outputs, the default arity will be0..*
or0..1
respectively.However, because this default arity fixes the single-item-or-list quirk, it is also a breaking change. You will have to remove whatever glue logic you added to handle the quirk, once you upgrade to a Nextflow version with this feature.UPDATE: The default arity has been removed, so there is no breaking change. If the arity is not specified, the existing behavior is used.
Nullable path inputs and outputs
PR: #2893
Issues: Linked in PR
This feature adds a
nullable
option for path inputs and outputs. If a path output is declared withnullable: true
and it's file is not produced by the task, it will emit a "null" path instead of failing. Similarly, a path input can receive such "null" paths if it is also declared nullable.You might be wondering why we didn't call this feature
optional
, or how it works with optional outputs. The PR thread shows the back-and-forth we've had on these questions, but there is an important difference between the two. To illustrate, consider two path outputs, one marked optional and one marked nullable. What will they do if both of their files are absent? The nullable path will emit a "null" path; the optional path will emit nothing. See the difference? The nullable path can still trigger downstream computations, but the optional path cannot.But the main purpose of this feature is to allow nested paths in a tuple to be "optional", because currently they can't be made optional on their own -- the entire tuple must be optional or not.
Additional thoughts
At this point, I'm quite happy with these PRs as they currently stand. I remain unconvinced that we need an
optional: true
for inputs, and I would like to resolve the ambiguity with named inputs. Having written everything out, I think all of these features should work together seamlessly. I think only the default values and named inputs PRs are touching the same bits of code, so we'll see what that merge conflict looks like.I'd like to hear what other users think about these proposed features. Please check out the docs and pipeline example in each PR and reply to this discussion with any feedback!
Addendum: Named inputs with default values
I combined these branches and tested just to make sure they work, and they do. So the combined args parsing logic is as follows:
a. if there are more positional arguments than remaining inputs, throw an error
a. if any of the excess inputs don't have a default value, throw an error
Here is an example script with the combined functionality. Notice that because every input can be passed by name, I can give a value for
hisat2
without having to give a value forstar
.Beta Was this translation helpful? Give feedback.
All reactions