-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for multiple process inputs #4311
Comments
There was a loooooooong discussion about this a looooong time ago (one of the first major discussions about the early nf-core guidelines I think). I think a few people will jump on this later with their arguments about this (e.g. @nvnieuwk and @emiller88 for example). However there were two camps, @maxulysse leading the 'best practise' the one you mention, vs the 'separated' one (I think that was led by @drpatelh? ). If I remember correctly (but I don't guarantee this though), the problem that the 'separated input' channel camp 'won' on was that depending on the module, it could lead to an very long/large number of (possibly optional!) tuple of index files that are reference specific having to be included in that one channel. I think people didn't like this as while more 'intuitive' (this files belong to this reference), it because very long and unweildly to read and use. Furthermore when you have optional files, you end up having to a lot of channel manipulation anyway to insert the empty lists. So having it split over multiple lines (i.e. channels) was preferred, as we thought at the time, this could be still dealt with with But I hope someone can correct me if I'm wrong, my memory isn't great atm. |
I really like the flexibility of everything in the same input channel. We just have to figure out a way to keep the input readable for those modules that have a lot of inputs (looking at gatk4/haplotypecaller for example). Something like this might work? I think it could be a great topic for the upcoming maintainers meeting though :)
|
I'm also a fan of having everything in one tuple! |
@bentsherman Love your idea, and that was my take on how we should have tackle this. But we're a community, and we collectively decided to follow @drpatelh's idea on how to deal with it. That being said, if we have some more changes in Nextflow, like named input (modules/workflows level), and a nicer way to deal with non existing input (other than passing over empty values), then it would be another deal, and reworking the way we deal with multiple inputs in the modules will be a hill I'm willing to die for. |
I am a reformed user of crazy-long channel tuples (though that was DSL1), and I would not want to encourage my past self to return to that! All kinds of mess happens. Plus I've not had to use the dark arts of multiMap very often. That said, if joins/combines only happened when the modules were called (so we wouldn't change workflows much, just the module inputs), then I'd be cool with that... |
Hey! It looks prettier, but I am relatively terrified of the amount of combine statements we would need to when calling each module and how error prone that might be |
Thank you all for chiming in. I figured there was some history here, and I understand why you settled on the current approach. While the multiMap solution depends on the fact that Nextflow operators are single-threaded, this implementation detail is very unlikely to change IMO. Also, the nf-core CLI makes it easy to patch modules locally, if e.g. a user wants to take a different approach to process inputs. That being said, maybe we can find some syntax improvements that would satisfy both camps. Long tuples are a minor aesthetic issue (by the way @nvnieuwk 's suggestion works if you add parens to each element), but optional elements are a larger concern, as they are simply not supported yet. What I think will sort this all out is better support for record types as we are discussing in nextflow-io/nextflow#2085. Then you wouldn't need to enumerate the tuple elements in the process, just the type. For example: @ValueObject
class MyRecord {
Map meta
Path sample
Path index1
Optional<Path> index2
Optional<Path> index3
}
process TEST {
input:
MyRecord in
"""
echo ${in.meta} ${in.sample} ${in.index1} ${in.index3} ${in.index3}
"""
}
workflow {
sample = new MySample(
[id: 'me'],
file('sample.fastq'),
file('index1.idx'),
Optional.empty(),
Optional.of(file('index3.idx')),
)
TEST( sample )
} Alternatively if we figure out nullable paths (nextflow-io/nextflow#4293) then that would at least solve the problem of optional elements. I can't offer a simpler alternative to the combination logic because combining is exactly what you are trying to do 😄 Value channels are a shortcut only if you have one index, and Would love to hear what you guys think about the above ideas as it pertains to resolving this debate over multiple inputs. Both of them are already on our roadmap, so they will happen eventually |
I wish it did a magical join behind the scenes: process PROC {
input:
tuple val(meta), path(reads), path(index) And process PROC {
input:
tuple val(meta), path(reads)
tuple val(meta), path(index) Were equivalent. But I'm not sure whether that might give unintended effects, just like it does right now. |
OOOH I like the record type solution. If we could package these with the modules we can cast the right files to it without having the bloat of one massive input tuple while still having all the benefits. Will it be possible to access these in the workflow scripts if it's defined in the module script? |
Wanted to circle back on this thread as I think I've found a comprehensive solution to this multiple inputs problem, which goes beyond record types and into DSL3 territory... I have a POC for fetchngs here: nf-core/fetchngs#309 It is way broader than just this thread, but wanted to share it here for any interested folks. The tl;dr, I think we can avoid the issue of multiple input channels (and long record/tuple inputs) by never calling processes directly with channels... |
Is your feature request related to a problem? Please describe
Howdy folks
I'd like to discuss a common convention I see with nf-core modules, where a process has two separate inputs for e.g. a sample and an index. Here are a few examples I found:
So you have two inputs:
input: tuple val(meta), path(reads) tuple val(meta2), path(index)
This convention works fine as long as you have a single index, in which case you can provide the index as a value channel and it will be "broadcast" to every sample, basically an implicit cross product.
But what if you have multiple indices? The process inputs are not really set up to handle this, so you have to do a bit of hacking:
But now you're wondering if multiMap preserves the order of its inputs, and that question leads down a deep rabbit hole. I have now led multiple people through that rabbit hole, and every time it leads me back to the original problem of multiple inputs. It's the reason why I added this note to the docs.
It's not always nf-core modules that are the cause, just "someone else's process that I'm trying to re-use". In any case, I'm hoping that I can broach the subject and spread this best practice to the community. Are people aware of this issue? Have you debated over this convention in the past? If so I would prefer to build on whatever previous discussions were had.
By the way, here's how I think you SHOULD do it:
Easy! It works in all cases (one-to-one, many-to-one, many-to-many), and it doesn't require you play fast and loose with your dataflow
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: