Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag files for downstream reporting via publishDir #4042

Open
ewels opened this issue Jun 20, 2023 · 16 comments
Open

Flag files for downstream reporting via publishDir #4042

ewels opened this issue Jun 20, 2023 · 16 comments

Comments

@ewels
Copy link
Member

ewels commented Jun 20, 2023

New feature

An emerging standard for Nextflow pipelines is a root tower.yml file, used for providing reports to Tower.

A potential alternative is to instead define this metadata as part of publishDir, within the Nextflow config. This has a few advantages:

  • Removes the need for yet-another-config-file in the repository root
  • Keeps configuration of published files in a single location, not spread across multiple files
  • Less Tower-specific, more community friendly

In this location, Nextflow will know about the report status of files during the publish step and could potentially match patterns against actual files created, allowing some kind of metadata with precise file paths + report status to be generated in memory / in some kind of report.

Suggest implementation

My suggestion is to add a new directive: report (int). Non-zero values (or >0) could include that files should be shown within downstream reporting functionality. The integer value itself could then be used as a weighting factor when sorting that list.

The directive should be paired together with the ability to filter the published files for a given process based on filename / a closure.

Usage scenario

Based on the publishDir config for a process in the nf-core/rnaseq pipeline, syntax / usage could potentially look something like this:

  withName: '.*:BAM_RSEQC:RSEQC_READDISTRIBUTION' {
      publishDir = [
          path: { "${params.outdir}/${params.aligner}/rseqc/read_distribution" },
          mode: params.publish_dir_mode,
          saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+         report: { filename -> filename ==~ '.*\.pdf' ? 10 : null }
      ]
  }

Here, any PDF files published by this process would be given a report priority of 10. The integer > 0 indicates that they should be shown in a report interface, value 10 gives weighting score for sorting the list of files there.

The results of of this directive then need to be handled somehow. I expect this to be the most contentious part of this suggestion! My suggestion would be a new optional output file, similar to reports and trace files. This could potentially tie into future efforts for provenance tracking of published files.

@evanfloden
Copy link
Member

The file filtering is a nice touch.

  1. Do you envision either the need or the ability to have multiple report: lines within the directive? For example, if a user wanted to have two different reports with different weightings.

  2. Would it be possible to add meta data such as the display name as we do in the tower.yml?

@ewels
Copy link
Member Author

ewels commented Jun 21, 2023

  1. Yes, as publishDir can take a list - see here for an example.
  2. Yes, good point - I guess we need a new directive for that. report_title?

@ewels
Copy link
Member Author

ewels commented Jun 21, 2023

Hmm, I keep saying directive but I guess that these are actually new options for the existing publishDir directive, not new directives. Apologies, but hopefully you understand what I'm saying anyway 🙄

@evanfloden
Copy link
Member

evanfloden commented Jun 22, 2023

Brilliant. So a complete example could be:

publishDir = [
    [
        path: { "${params.outdir}/${params.trimmer}/fastqc" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.pdf' ? 10 : null },
        report_title: { "FASTQC Report" }
    ],
    [
        path: { "${params.outdir}/${params.trimmer}" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.tsv' ? 5 : null },
        report_title: { "Trimmer Gene Counts" }
    ],
]

@jordeu
Copy link
Collaborator

jordeu commented Jun 22, 2023

Current implementation we also support to define the mimeType at the YAML configuration. In general we don't use it because it is correctly deduced from the file extension.

And maybe in a future we'll have more things... I can imagine things like to choose the icon to show, to select a specific viewer for that file, or pass config parameters for some viewers...

We can keep adding reportTitle, reportMimeType... to the publishDir or we can make report expect a map instead of an int.

Something like:

publishDir = [
      [
          path: { "${params.outdir}/${params.trimmer}/fastqc" },
          mode: params.publish_dir_mode,
          report: { path -> path ==~ '.*\.pdf' ? [weight: 10, title: "FASTQC Report"] : null }
      ],
      [
          path: { "${params.outdir}/${params.trimmer}" },
          mode: params.publish_dir_mode,
          report: { filename -> filename ==~ '.*\.tsv' ? [weight: 5, title: "Trimmer Gene Counts", mimeType: "text/plain"] : null }
      ],
]

@ewels
Copy link
Member Author

ewels commented Jun 22, 2023

Yup, like the idea of a map - much more extensible and clearly associated 👍🏻

@stale stale bot added the stale label Dec 15, 2023
@nextflow-io nextflow-io deleted a comment from stale bot Dec 19, 2023
@ewels ewels removed the stale label Dec 19, 2023
@maxulysse
Copy link
Contributor

Could one add file from collectFile to this report too?

@pditommaso
Copy link
Member

I believe we reached the limit of the publishDir model; above all because it was designed for the dsl1 syntax and never worked properly for dsl2 world.

If you look at the config of nfcore/rnaseq pipeline, there are more than 1k lines of code to configure mostly the publishdir!

This should be redesigned from scratch in order to get rid of all the configuration boilerplate and, even more, make it possible to define a formal output definition (i.e. schema) both at process and workflow level.

I think the key points should be:

  • allow the definition of the data type of each process output
  • decouple the output type definition from the process definition, likely using a module level schema definition
  • including in this schema definition other metadata, such as: description, file extensions to be captured, report file, etc
  • allow composing of processes output schema into a top-level workflow output schema

@pditommaso
Copy link
Member

pditommaso commented Jan 3, 2024

Looking always the rnaseq config, most of the code is to define the sub-directory where the process output should be written.

I think could be dramatically simplified, reversing the problem. Instead of specifying process by process where the output should be written, I'd like to define an output (directory) tree, listing the processes that contribute to each path e.g.

'genome': { GFFREAD, GTF2BED, GTF_FILTER, .. }
'genome/index': { SALMON_INDEX, KALLISTO_INDEX, .. }

Though, it still it can be too verbose. Likely it should be introduced some kind of semantic annotation that would allow to tag all processes that need to contribute to a specific path e.g. genome_files, genome_index, etc. Then use this annotation to (re)map to target storage path.

Thoughts?

@bentsherman
Copy link
Member

It seems there are two ways to think about process output "data types":

  1. the in-memory data type
  2. the output directory structure

For example, a process output that emits a list of files for each task will have an in-memory type of List<Path>, but in the output directory it might just be a subdirectory or glob pattern. You could also think about the file type (i.e. mime type) of these files.

I like the idea of separating these concepts, and defining the output directory structure in terms of the process outputs. I did a similar thing with the annotation API to enable custom types for process inputs:

// process inputs
take 'sample', type: Sample
// file staging
path { sample.files }

And a symmetric approach to enable custom types for process outputs:

// file unstaging
path '$file1', '*.fastq'
// process outputs
emit { new Sample(id, path('$file1') }, name: 'samples'

Don't worry so much about the syntax, it's just to illustrate how the staging/unstaging of files to/from the task environment is separated from the process inputs/outputs definition in order to enable custom types. Now, the "publishing" of process outputs to the output directory of a workflow run is basically the same thing at a higher level.

What I am imagining is the ability to specify the entire output directory structure of a workflow:

[
  'fastqc': [
    FASTQC.out.html
  ],
  'genome': [
    GFFREAD.out, GTF2BED.out, GTF_FILTER.out, // ...
  ],
  'genome/index': [
    SALMON_INDEX.out, KALLISTO_INDEX.out, // ...
  ],
  'multiqc': [
    MULTIQC.out
  ]
]

Again, just an illustrative syntax. Probably would need to be extended to support metadata and maybe file types. Maybe use a builder syntax instead of a map. Although it would be verbose for a large pipeline, it would be much simpler than the current modules.config approach as seen in nf-core/rnaseq, because there is much less duplicate/boilerplate code.

The main question I have is where to put it. It probably needs to be configurable separate from the pipeline code, which suggests it should be in the config file. But also it seems to be tied to workflow definitions, and ideally the pipeline output schema would be a composition of the subworkflow schemas.

This makes me think we should take a similar approach to the module config effort:

  • the output schema for a process or workflow is defined in a module config file alongside the module script
  • a process output schema isn't useful by itself but can be reused in workflow output schemas
  • a workflow output schema can reference the outputs of processes that it calls just like in the workflow emit: section

I think this is the right direction, but will need to develop a prototype and iterate on it to find a syntax that is intuitive and meets all of our needs. If we can come up with a comprehensive syntax that can handle the complexity of nf-core/rnaseq (plus the extra metadata requested in this issue), it should be easier from there to build some shorthands for simpler use cases.

@pditommaso
Copy link
Member

It seems there are two ways to think about process output "data types":

  1. the in-memory data type
  2. the output directory structure

Good point. Tho I'd argue the first are related to internal intra-tasks "communication", the latter is related to the external workflow output, that should be the focus of the replacement of the publishDir.

Likely the first could be generalised to capture also the workflow output, but I fear it could become too complex

@bentsherman
Copy link
Member

I agree I'd rather not try to tackle both at once. Maybe we can design the workflow output schema in a way that doesn't require new functionality in the process output definition.

If we only consider output files, then it should be straightforward. But if we also want to include metadata in the output schema (i.e. val process outputs), I'm not yet sure how to do that. Static metadata like descriptions should be easy. But it sounds like people will want to include things like the meta map in this output schema so that it can be queried by downstream workflows. Since people usually encode metadata in the output file names, maybe we could start with that. I will have to think on it further.

@pditommaso
Copy link
Member

it turns out, nf-core people have already done most of the job! 😆

https://github.com/nf-core/modules/blob/master/modules/nf-core/parabricks/fq2bam/meta.yml#L48C8-L77

I think we should build on this, add the missing metadata and "formalise" it as a core spec

@ewels
Copy link
Member Author

ewels commented Jan 10, 2024

hah, yes we have the meta.yml file. We currently mostly auto-generate this file by parsing the Nextflow code for the process. Then the developer adds the descriptions. The original idea when I made it was that at some point in the future (when we have time™️ ) it could be used to create some kind of visual workflow builder. I figured it might be useful for soemthing either way and didn't want to retrofit it for 1000s of modules, so we put the file in place from the beginning. However, it's used for very little at the moment. Possibly just the website docs I think.

It has a few issues as it stands:

  • At module level, not pipeline
  • It's specifying output channels, not which files to publish
  • It's a separate file - not part of the current pipeline or config files

But having it or something like it as part of a solution could be good 👍🏻

@pditommaso
Copy link
Member

At module level, not pipeline

yeah. that's good! the workflow schema will be managed separately

It's specifying output channels, not which files to publish

But it can be extended adding also the report files, the tags that should be applied to the files, etc

It's a separate file - not part of the current pipeline or config files

That's good as well!

@bentsherman
Copy link
Member

Let's move the discussion of output schema to #4670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants