Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextflow output tags don't propagate to subdirectories on S3 #5619

Open
willbradshaw opened this issue Dec 17, 2024 · 1 comment
Open

Nextflow output tags don't propagate to subdirectories on S3 #5619

willbradshaw opened this issue Dec 17, 2024 · 1 comment

Comments

@willbradshaw
Copy link

Bug report

Summary

When nextflow creates output files on S3 using the output block, tags specified in that block do not propagate to subdirectories created by processes. As a result, some files in the output block do not have the specified tags, but instead inherit the tags of the files they were copied from in the working directory. This causes problems for tag-based applications like S3 lifecycle rules.

Expected behavior and actual behavior

Under the new workflow-level output feature, files can be published using the publish section of a workflow combined with an output block configuring the output directory. This includes the option to add tags to the published files, which can then be used in downstream applications on AWS.

For various reasons1, some Nextflow processes output directories containing files, rather than flat files. In this case, when the output of that process is published, I would expect the specified tags from the output block to be applied to the files in those process output directories. Instead, these descendant files inherit their tags from the files in the working directory they are copied from. Among other things, this means that tag-based downstream applications will treat these files as though they were temporary files in a working directory; for example, S3 lifecycle rules intended to delete temporary files after a certain time period will also delete these files.

To my knowledge (which is certainly not exhaustive) there is currently no way to correctly tag these files in subdirectories correctly using Nextflow.

Steps to reproduce the problem

A minimum working example of the problem is available here. To run it, first edit params.base_dir in nextflow.config to point to an S3 bucket you have write access to.

The workflow produces two output files, <base_dir>/output/results/flat_file.txt and <base_dir>/output/results/file_dir/dir_file.txt:

  • In my hands, when inspecting the output files on S3, the workflow tags flat_file.txt as expected, with nextflow_file_class = publish and nextflow.io/temporary = false.
  • Conversely, dir_file.txt is not tagged as expected, and just inherits nextflow.io/temporary = true from its workflow parent. This causes it to be swept up by lifecycle rules that use nextflow.io/temporary as a marker of temporary status.

Program output

(Copy and paste here output produced by the failing execution. Please highlight it as a code block. Whenever possible upload the .nextflow.log file.)

N/A, execution does not fail -- the workflow completes but does not tag all output files correctly.

Environment

  • Nextflow version: version 24.10.3 build 5933
  • Java version: Temurin-17.0.10+7 (build 17.0.10+7)
  • Operating system: Linux (Amazon Linux 2023)
  • Bash version: GNU bash, version 5.2.15(1)-release (x86_64-amazon-linux-gnu)

Additional context

N/A

Footnotes

  1. In our case, mostly involving creating indexes for bioinformatic programs (e.g. Bowtie2, Kraken, BBMap).

@bentsherman
Copy link
Member

I think this can be solved by #3933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants