-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag files for downstream reporting via publishDir
#4042
Comments
The file filtering is a nice touch.
|
|
Hmm, I keep saying directive but I guess that these are actually new options for the existing |
Brilliant. So a complete example could be: publishDir = [
[
path: { "${params.outdir}/${params.trimmer}/fastqc" },
mode: params.publish_dir_mode,
report: { filename -> filename ==~ '.*\.pdf' ? 10 : null },
report_title: { "FASTQC Report" }
],
[
path: { "${params.outdir}/${params.trimmer}" },
mode: params.publish_dir_mode,
report: { filename -> filename ==~ '.*\.tsv' ? 5 : null },
report_title: { "Trimmer Gene Counts" }
],
] |
Current implementation we also support to define the And maybe in a future we'll have more things... I can imagine things like to choose the icon to show, to select a specific viewer for that file, or pass config parameters for some viewers... We can keep adding Something like: publishDir = [
[
path: { "${params.outdir}/${params.trimmer}/fastqc" },
mode: params.publish_dir_mode,
report: { path -> path ==~ '.*\.pdf' ? [weight: 10, title: "FASTQC Report"] : null }
],
[
path: { "${params.outdir}/${params.trimmer}" },
mode: params.publish_dir_mode,
report: { filename -> filename ==~ '.*\.tsv' ? [weight: 5, title: "Trimmer Gene Counts", mimeType: "text/plain"] : null }
],
] |
Yup, like the idea of a map - much more extensible and clearly associated 👍🏻 |
Could one add file from collectFile to this |
I believe we reached the limit of the If you look at the config of nfcore/rnaseq pipeline, there are more than 1k lines of code to configure mostly the publishdir! This should be redesigned from scratch in order to get rid of all the configuration boilerplate and, even more, make it possible to define a formal output definition (i.e. schema) both at process and workflow level. I think the key points should be:
|
Looking always the rnaseq config, most of the code is to define the sub-directory where the process output should be written. I think could be dramatically simplified, reversing the problem. Instead of specifying process by process where the output should be written, I'd like to define an output (directory) tree, listing the processes that contribute to each path e.g.
Though, it still it can be too verbose. Likely it should be introduced some kind of semantic annotation that would allow to tag all processes that need to contribute to a specific path e.g. Thoughts? |
It seems there are two ways to think about process output "data types":
For example, a process output that emits a list of files for each task will have an in-memory type of I like the idea of separating these concepts, and defining the output directory structure in terms of the process outputs. I did a similar thing with the annotation API to enable custom types for process inputs: // process inputs
take 'sample', type: Sample
// file staging
path { sample.files } And a symmetric approach to enable custom types for process outputs: // file unstaging
path '$file1', '*.fastq'
// process outputs
emit { new Sample(id, path('$file1') }, name: 'samples' Don't worry so much about the syntax, it's just to illustrate how the staging/unstaging of files to/from the task environment is separated from the process inputs/outputs definition in order to enable custom types. Now, the "publishing" of process outputs to the output directory of a workflow run is basically the same thing at a higher level. What I am imagining is the ability to specify the entire output directory structure of a workflow: [
'fastqc': [
FASTQC.out.html
],
'genome': [
GFFREAD.out, GTF2BED.out, GTF_FILTER.out, // ...
],
'genome/index': [
SALMON_INDEX.out, KALLISTO_INDEX.out, // ...
],
'multiqc': [
MULTIQC.out
]
] Again, just an illustrative syntax. Probably would need to be extended to support metadata and maybe file types. Maybe use a builder syntax instead of a map. Although it would be verbose for a large pipeline, it would be much simpler than the current The main question I have is where to put it. It probably needs to be configurable separate from the pipeline code, which suggests it should be in the config file. But also it seems to be tied to workflow definitions, and ideally the pipeline output schema would be a composition of the subworkflow schemas. This makes me think we should take a similar approach to the module config effort:
I think this is the right direction, but will need to develop a prototype and iterate on it to find a syntax that is intuitive and meets all of our needs. If we can come up with a comprehensive syntax that can handle the complexity of nf-core/rnaseq (plus the extra metadata requested in this issue), it should be easier from there to build some shorthands for simpler use cases. |
Good point. Tho I'd argue the first are related to internal intra-tasks "communication", the latter is related to the external workflow output, that should be the focus of the replacement of the publishDir. Likely the first could be generalised to capture also the workflow output, but I fear it could become too complex |
I agree I'd rather not try to tackle both at once. Maybe we can design the workflow output schema in a way that doesn't require new functionality in the process output definition. If we only consider output files, then it should be straightforward. But if we also want to include metadata in the output schema (i.e. |
it turns out, nf-core people have already done most of the job! 😆 https://github.com/nf-core/modules/blob/master/modules/nf-core/parabricks/fq2bam/meta.yml#L48C8-L77 I think we should build on this, add the missing metadata and "formalise" it as a core spec |
hah, yes we have the It has a few issues as it stands:
But having it or something like it as part of a solution could be good 👍🏻 |
yeah. that's good! the workflow schema will be managed separately
But it can be extended adding also the report files, the tags that should be applied to the files, etc
That's good as well! |
Let's move the discussion of output schema to #4670 |
New feature
An emerging standard for Nextflow pipelines is a root
tower.yml
file, used for providing reports to Tower.A potential alternative is to instead define this metadata as part of
publishDir
, within the Nextflow config. This has a few advantages:In this location, Nextflow will know about the report status of files during the publish step and could potentially match patterns against actual files created, allowing some kind of metadata with precise file paths + report status to be generated in memory / in some kind of report.
Suggest implementation
My suggestion is to add a new directive:
report
(int
). Non-zero values (or >0) could include that files should be shown within downstream reporting functionality. The integer value itself could then be used as a weighting factor when sorting that list.The directive should be paired together with the ability to filter the published files for a given process based on filename / a closure.
Usage scenario
Based on the
publishDir
config for a process in the nf-core/rnaseq pipeline, syntax / usage could potentially look something like this:withName: '.*:BAM_RSEQC:RSEQC_READDISTRIBUTION' { publishDir = [ path: { "${params.outdir}/${params.aligner}/rseqc/read_distribution" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + report: { filename -> filename ==~ '.*\.pdf' ? 10 : null } ] }
Here, any PDF files published by this process would be given a report priority of
10
. The integer > 0 indicates that they should be shown in a report interface, value10
gives weighting score for sorting the list of files there.The results of of this directive then need to be handled somehow. I expect this to be the most contentious part of this suggestion! My suggestion would be a new optional output file, similar to reports and trace files. This could potentially tie into future efforts for provenance tracking of published files.
The text was updated successfully, but these errors were encountered: