Flag files for downstream reporting via `publishDir` #4042

ewels · 2023-06-20T23:01:32Z

New feature

An emerging standard for Nextflow pipelines is a root tower.yml file, used for providing reports to Tower.

A potential alternative is to instead define this metadata as part of publishDir, within the Nextflow config. This has a few advantages:

Removes the need for yet-another-config-file in the repository root
Keeps configuration of published files in a single location, not spread across multiple files
Less Tower-specific, more community friendly

In this location, Nextflow will know about the report status of files during the publish step and could potentially match patterns against actual files created, allowing some kind of metadata with precise file paths + report status to be generated in memory / in some kind of report.

Suggest implementation

My suggestion is to add a new directive: report (int). Non-zero values (or >0) could include that files should be shown within downstream reporting functionality. The integer value itself could then be used as a weighting factor when sorting that list.

The directive should be paired together with the ability to filter the published files for a given process based on filename / a closure.

Usage scenario

Based on the publishDir config for a process in the nf-core/rnaseq pipeline, syntax / usage could potentially look something like this:

  withName: '.*:BAM_RSEQC:RSEQC_READDISTRIBUTION' {
      publishDir = [
          path: { "${params.outdir}/${params.aligner}/rseqc/read_distribution" },
          mode: params.publish_dir_mode,
          saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+         report: { filename -> filename ==~ '.*\.pdf' ? 10 : null }
      ]
  }

Here, any PDF files published by this process would be given a report priority of 10. The integer > 0 indicates that they should be shown in a report interface, value 10 gives weighting score for sorting the list of files there.

The results of of this directive then need to be handled somehow. I expect this to be the most contentious part of this suggestion! My suggestion would be a new optional output file, similar to reports and trace files. This could potentially tie into future efforts for provenance tracking of published files.

The text was updated successfully, but these errors were encountered:

evanfloden · 2023-06-21T09:41:05Z

The file filtering is a nice touch.

Do you envision either the need or the ability to have multiple report: lines within the directive? For example, if a user wanted to have two different reports with different weightings.
Would it be possible to add meta data such as the display name as we do in the tower.yml?

ewels · 2023-06-21T10:12:49Z

Yes, as publishDir can take a list - see here for an example.
Yes, good point - I guess we need a new directive for that. report_title?

ewels · 2023-06-21T10:13:51Z

Hmm, I keep saying directive but I guess that these are actually new options for the existing publishDir directive, not new directives. Apologies, but hopefully you understand what I'm saying anyway 🙄

evanfloden · 2023-06-22T06:25:42Z

Brilliant. So a complete example could be:

publishDir = [
    [
        path: { "${params.outdir}/${params.trimmer}/fastqc" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.pdf' ? 10 : null },
        report_title: { "FASTQC Report" }
    ],
    [
        path: { "${params.outdir}/${params.trimmer}" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.tsv' ? 5 : null },
        report_title: { "Trimmer Gene Counts" }
    ],
]

jordeu · 2023-06-22T07:05:30Z

Current implementation we also support to define the mimeType at the YAML configuration. In general we don't use it because it is correctly deduced from the file extension.

And maybe in a future we'll have more things... I can imagine things like to choose the icon to show, to select a specific viewer for that file, or pass config parameters for some viewers...

We can keep adding reportTitle, reportMimeType... to the publishDir or we can make report expect a map instead of an int.

Something like:

publishDir = [
      [
          path: { "${params.outdir}/${params.trimmer}/fastqc" },
          mode: params.publish_dir_mode,
          report: { path -> path ==~ '.*\.pdf' ? [weight: 10, title: "FASTQC Report"] : null }
      ],
      [
          path: { "${params.outdir}/${params.trimmer}" },
          mode: params.publish_dir_mode,
          report: { filename -> filename ==~ '.*\.tsv' ? [weight: 5, title: "Trimmer Gene Counts", mimeType: "text/plain"] : null }
      ],
]

ewels · 2023-06-22T07:32:07Z

Yup, like the idea of a map - much more extensible and clearly associated 👍🏻

maxulysse · 2023-12-19T13:38:14Z

Could one add file from collectFile to this report too?

pditommaso · 2024-01-03T20:22:41Z

I believe we reached the limit of the publishDir model; above all because it was designed for the dsl1 syntax and never worked properly for dsl2 world.

If you look at the config of nfcore/rnaseq pipeline, there are more than 1k lines of code to configure mostly the publishdir!

This should be redesigned from scratch in order to get rid of all the configuration boilerplate and, even more, make it possible to define a formal output definition (i.e. schema) both at process and workflow level.

I think the key points should be:

allow the definition of the data type of each process output
decouple the output type definition from the process definition, likely using a module level schema definition
including in this schema definition other metadata, such as: description, file extensions to be captured, report file, etc
allow composing of processes output schema into a top-level workflow output schema

pditommaso · 2024-01-03T20:43:17Z

Looking always the rnaseq config, most of the code is to define the sub-directory where the process output should be written.

I think could be dramatically simplified, reversing the problem. Instead of specifying process by process where the output should be written, I'd like to define an output (directory) tree, listing the processes that contribute to each path e.g.

'genome': { GFFREAD, GTF2BED, GTF_FILTER, .. }
'genome/index': { SALMON_INDEX, KALLISTO_INDEX, .. }

Though, it still it can be too verbose. Likely it should be introduced some kind of semantic annotation that would allow to tag all processes that need to contribute to a specific path e.g. genome_files, genome_index, etc. Then use this annotation to (re)map to target storage path.

Thoughts?

bentsherman · 2024-01-04T00:02:11Z

It seems there are two ways to think about process output "data types":

the in-memory data type
the output directory structure

For example, a process output that emits a list of files for each task will have an in-memory type of List<Path>, but in the output directory it might just be a subdirectory or glob pattern. You could also think about the file type (i.e. mime type) of these files.

I like the idea of separating these concepts, and defining the output directory structure in terms of the process outputs. I did a similar thing with the annotation API to enable custom types for process inputs:

// process inputs
take 'sample', type: Sample
// file staging
path { sample.files }

And a symmetric approach to enable custom types for process outputs:

// file unstaging
path '$file1', '*.fastq'
// process outputs
emit { new Sample(id, path('$file1') }, name: 'samples'

Don't worry so much about the syntax, it's just to illustrate how the staging/unstaging of files to/from the task environment is separated from the process inputs/outputs definition in order to enable custom types. Now, the "publishing" of process outputs to the output directory of a workflow run is basically the same thing at a higher level.

What I am imagining is the ability to specify the entire output directory structure of a workflow:

[
  'fastqc': [
    FASTQC.out.html
  ],
  'genome': [
    GFFREAD.out, GTF2BED.out, GTF_FILTER.out, // ...
  ],
  'genome/index': [
    SALMON_INDEX.out, KALLISTO_INDEX.out, // ...
  ],
  'multiqc': [
    MULTIQC.out
  ]
]

Again, just an illustrative syntax. Probably would need to be extended to support metadata and maybe file types. Maybe use a builder syntax instead of a map. Although it would be verbose for a large pipeline, it would be much simpler than the current modules.config approach as seen in nf-core/rnaseq, because there is much less duplicate/boilerplate code.

The main question I have is where to put it. It probably needs to be configurable separate from the pipeline code, which suggests it should be in the config file. But also it seems to be tied to workflow definitions, and ideally the pipeline output schema would be a composition of the subworkflow schemas.

This makes me think we should take a similar approach to the module config effort:

the output schema for a process or workflow is defined in a module config file alongside the module script
a process output schema isn't useful by itself but can be reused in workflow output schemas
a workflow output schema can reference the outputs of processes that it calls just like in the workflow emit: section

I think this is the right direction, but will need to develop a prototype and iterate on it to find a syntax that is intuitive and meets all of our needs. If we can come up with a comprehensive syntax that can handle the complexity of nf-core/rnaseq (plus the extra metadata requested in this issue), it should be easier from there to build some shorthands for simpler use cases.

pditommaso · 2024-01-05T21:56:52Z

It seems there are two ways to think about process output "data types":

the in-memory data type

the output directory structure

Good point. Tho I'd argue the first are related to internal intra-tasks "communication", the latter is related to the external workflow output, that should be the focus of the replacement of the publishDir.

Likely the first could be generalised to capture also the workflow output, but I fear it could become too complex

bentsherman · 2024-01-05T23:45:27Z

I agree I'd rather not try to tackle both at once. Maybe we can design the workflow output schema in a way that doesn't require new functionality in the process output definition.

If we only consider output files, then it should be straightforward. But if we also want to include metadata in the output schema (i.e. val process outputs), I'm not yet sure how to do that. Static metadata like descriptions should be easy. But it sounds like people will want to include things like the meta map in this output schema so that it can be queried by downstream workflows. Since people usually encode metadata in the output file names, maybe we could start with that. I will have to think on it further.

pditommaso · 2024-01-10T02:29:37Z

it turns out, nf-core people have already done most of the job! 😆

https://github.com/nf-core/modules/blob/master/modules/nf-core/parabricks/fq2bam/meta.yml#L48C8-L77

I think we should build on this, add the missing metadata and "formalise" it as a core spec

ewels · 2024-01-10T08:25:04Z

hah, yes we have the meta.yml file. We currently mostly auto-generate this file by parsing the Nextflow code for the process. Then the developer adds the descriptions. The original idea when I made it was that at some point in the future (when we have time™️ ) it could be used to create some kind of visual workflow builder. I figured it might be useful for soemthing either way and didn't want to retrofit it for 1000s of modules, so we put the file in place from the beginning. However, it's used for very little at the moment. Possibly just the website docs I think.

It has a few issues as it stands:

At module level, not pipeline
It's specifying output channels, not which files to publish
It's a separate file - not part of the current pipeline or config files

But having it or something like it as part of a solution could be good 👍🏻

pditommaso · 2024-01-12T15:25:32Z

At module level, not pipeline

yeah. that's good! the workflow schema will be managed separately

It's specifying output channels, not which files to publish

But it can be extended adding also the report files, the tags that should be applied to the files, etc

It's a separate file - not part of the current pipeline or config files

That's good as well!

bentsherman · 2024-01-17T19:45:24Z

Let's move the discussion of output schema to #4670

stale bot added the stale label Dec 15, 2023

nextflow-io deleted a comment from stale bot Dec 19, 2023

ewels removed the stale label Dec 19, 2023

bentsherman mentioned this issue Jan 17, 2024

Workflow output definitions and schema #4670

Closed

bentsherman mentioned this issue Feb 29, 2024

Workflow output definition #4784

Merged

bentsherman mentioned this issue Oct 25, 2024

Generate output schema from output definition #5213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flag files for downstream reporting via `publishDir` #4042

Flag files for downstream reporting via `publishDir` #4042

ewels commented Jun 20, 2023

evanfloden commented Jun 21, 2023

ewels commented Jun 21, 2023

ewels commented Jun 21, 2023

evanfloden commented Jun 22, 2023 •

edited by ewels

Loading

jordeu commented Jun 22, 2023 •

edited by ewels

Loading

ewels commented Jun 22, 2023

maxulysse commented Dec 19, 2023

pditommaso commented Jan 3, 2024

pditommaso commented Jan 3, 2024 •

edited

Loading

bentsherman commented Jan 4, 2024

pditommaso commented Jan 5, 2024

bentsherman commented Jan 5, 2024

pditommaso commented Jan 10, 2024

ewels commented Jan 10, 2024

pditommaso commented Jan 12, 2024

bentsherman commented Jan 17, 2024

Flag files for downstream reporting via publishDir #4042

Flag files for downstream reporting via publishDir #4042

Comments

ewels commented Jun 20, 2023

New feature

Suggest implementation

Usage scenario

evanfloden commented Jun 21, 2023

ewels commented Jun 21, 2023

ewels commented Jun 21, 2023

evanfloden commented Jun 22, 2023 • edited by ewels Loading

jordeu commented Jun 22, 2023 • edited by ewels Loading

ewels commented Jun 22, 2023

maxulysse commented Dec 19, 2023

pditommaso commented Jan 3, 2024

pditommaso commented Jan 3, 2024 • edited Loading

bentsherman commented Jan 4, 2024

pditommaso commented Jan 5, 2024

bentsherman commented Jan 5, 2024

pditommaso commented Jan 10, 2024

ewels commented Jan 10, 2024

pditommaso commented Jan 12, 2024

bentsherman commented Jan 17, 2024

Flag files for downstream reporting via `publishDir` #4042

Flag files for downstream reporting via `publishDir` #4042

evanfloden commented Jun 22, 2023 •

edited by ewels

Loading

jordeu commented Jun 22, 2023 •

edited by ewels

Loading

pditommaso commented Jan 3, 2024 •

edited

Loading