Automatically delete files marked as temp as soon as not needed anymore #452

andreas-wilm · 2017-09-15T08:20:14Z

To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?

Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.

Andreas

pditommaso · 2017-09-19T07:21:11Z

I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell

pditommaso · 2018-08-28T13:58:05Z

I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature.

First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the terminate.

Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. cached=true in the trace record.

Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the -resume flag is specified the pipeline is just re-executed from the beginning, skipping the processes for which the output files already exists. However all (dataflow) output channel are created binding the output files to those channel.

Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) cached flag is not true. This that the output files of the last executed task can be picked and re-injected in the dataflow network and restart it.

This may require to introduce a new resume command #544. It could be also used to implement a kind of dry-run feature as suggested by #844. Finally this could also solve #828.

lucacozzuto · 2018-10-02T10:55:36Z

My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed).
I'm imagin a situation like this:
[sampleID][PROCESS 1] = COMPLETED
[sampleID][PROCESS 2] = COMPLETED
[sampleID][PROCESS 3] = COMPLETED
[sampleID][PROCESS 4 / TERMINAL] = COMPLETED

remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID]

In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder.

lucacozzuto · 2018-10-02T11:04:35Z

Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems.

PeteClapham · 2019-07-23T11:59:36Z

So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ?

pditommaso · 2019-07-26T08:30:25Z

Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something?

lucacozzuto · 2019-09-20T10:54:32Z

I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow.
This should allow to remove the whole folders containing the input files, so that we reduce the number of folders too.

Of course this will not work if these files will be needed by other processes.

Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered.

stale · 2020-04-27T03:34:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fmorency · 2020-10-23T18:03:45Z

I like the ideas in this thread. Automatic removal of "intermediate process files" would be great.

olavurmortensen · 2020-11-09T11:43:39Z

This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power.

jvivian-atreca · 2020-11-23T22:24:48Z

I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing — the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution.

That said, is there any reason that temporary file cleanup isn't the purview of the process's script? There are several ways to delete anything that doesn't match a specific pattern in bash, thereby removing all temporary files except the known inputs/outputs.

jfy133 · 2022-08-18T12:03:58Z

cleanup = true

Hi, where do you put cleanup=true? In config file? what branch? Thanks

https://www.nextflow.io/docs/latest/config.html?highlight=cleanup#miscellaneous

emosyne · 2022-08-18T12:49:51Z

thanks a lot.

bobamess · 2022-08-18T13:17:40Z

I usually put
cleanup = true
near the top of the config file after defining things like taskName, workspace, workDir and before any named blocks.
However, in my experience it only deletes files in the subdir of workDir and not the subdir in the workDir, which would be nice. Unless this is because the last time I checked this it was with an older version of Nextflow.

jgarces02 · 2022-09-09T09:56:35Z

@bentsherman's solutions seems very attractive (clean_work_files.sh). Is there any possibilities to include it within next versions? (or how can I tweak sarek to include it?)

bentsherman · 2022-09-12T16:10:15Z

@jgarces02 for now you have to wire the dependencies yourself, see the GEMmaker pipeline script for an example. I'm currently trying to automate this behavior by specifying e.g. temporary: true on a path process output. I'm nearly at the point of understanding the codebase well enough to actually know how to do it. 😅

spficklin · 2022-09-12T21:02:45Z

Wonderful @bentsherman !

hw538 · 2022-12-10T23:45:07Z

any exciting news about this feature request? :)

bentsherman · 2022-12-12T16:11:22Z

This feature has been on the backburner this year due to other pressing efforts, but we're finally beginning to make some headway. I'm currently working on a PR (#3463) that will allow Nextflow to track the full task graph, which will comprise the "first half" of this feature (but it's also useful for other things like provenance).

The second half will be to use the task graph to figure out when an output file can be deleted, something like:

process outputs can be marked as temporary: path(bam_file, temporary: true)
given a temporary output file F, delete F when all consumers of F are complete
on a resumed run, mark F as cached if all consumers of F are cached

Still kinda fuzzy about point (3). but I think there are a number of possible ways to do it.

spvensko · 2022-12-13T19:54:18Z

I am currently working on a blog post that will hopefully be published either later this week or early next week to go over examples of implementing GEMMaker's clean_work_files.sh strategy. The blog post will go over syntactical considerations of implementation and also a few pitfalls I encountered. I realize issue #3463 and associated future issues will hopefully make this issue absolute, but I think it's worth having a tutorial to help those that want to implement a solution in the meantime.

We've implemented this into our rather large neoantigen workflow (LENS) and it appears it will save us tons of storage.

bentsherman · 2022-12-13T20:25:36Z

@spvensko that's great, I agree it would be good to have a general example for people to reference in the meantime. I'd like to have such an example for the Nextflow patterns website (or wherever that content ends up in the website revamp), but I never got around to writing it myself. Looking forward to your blog post.

mribeirodantas · 2022-12-13T20:39:11Z

Please share it when it's done, @spvensko 😄

spvensko · 2022-12-22T21:48:19Z

Blog post is available now: https://pirl.unc.edu/blog/tricking-nextflows-caching-system-to-drastically-reduce-storage-usage

I'm going to be on PTO for the rest of the year, so hopefully there aren't any major issues with it. 😅

bentsherman · 2023-03-31T21:54:11Z

Folks, it's happening: #3818

Basically a minimal implementation of GEMmaker's "clean work files" approach directly into Nextflow. Several caveats and limitations to consider, but even this piece should be enough to make production pipelines much more storage efficient. Testing and feedback are appreciated! Feel free to message me on Slack if you don't want to clog up this issue.

stevekm · 2024-02-02T16:20:12Z

@bentsherman just wanted to follow up, is this feature 100% complete? was not sure since this Issue is still marked as Open. Thanks.

bentsherman · 2024-02-02T16:31:48Z

The automatic cleanup works but the resumability still has some issues. I had to focus on other things for a while but I have picked up this effort again, hope to finish the resumability in the next few months. See #3849 for updates.

If there are lots of people who don't care about the resumability piece, I could push to have the basic cleanup merged ASAP and complete the resumability in a separate effort. That would mean that for now, if you enable automatic cleanup and e.g. your pipeline fails half-way through due to some bug, you might not be able to resume because some task outputs will have been deleted.

cc @pditommaso @marcodelapierre for their thoughts

ewels · 2024-02-02T23:52:28Z

I would be in favour of getting the automatic cleanup feature in ASAP, with or without resumability 👍🏻

For quite a few people this can make the difference between being able to run a pipeline at all or not, at which point being able to resume it is purely a nicety.

We should definitely aim to have the full cake, but getting in the basic cleanup quickly would be very nice.

lescai · 2024-02-03T08:04:22Z

I agree with @ewels
when you analyse large datasets you might not have a choice, and ideally you’d be in production and any source for failure at least not pipeline-related.

lucacozzuto · 2024-02-03T08:15:49Z

I also agree. To resume a pipeline is important but in some context you cannot even run the pipeline for lack of space

pinin4fjords · 2024-02-05T14:15:15Z

Agreed. Getting the footprint down to the point of feasibility where it's currently lacking is a worthwhile goal even sacrificing resumability in the short term.

pditommaso · 2024-02-05T14:19:08Z

Disagree. The reusability of pipelines is not a feature that can be compromised.

spvensko · 2024-02-05T14:22:15Z

It's worth noting that Stephen Ficklin's (@spficklin) solution allows intermediate file deletion (either in line with the workflow or at the end, depending on how it's coded) and resumability. There are limitations and it's relatively tedious to implement manually, but it's completely possible for us to have our cake and enjoy a slice or two.

bentsherman · 2024-02-05T16:01:34Z

@spvensko agree, it's just a matter of whether the resumability should be tied to the core cleanup if it will take longer to implement.

Paolo and I have discussed. I have prepared a PR (#4713) with only the cleanup piece for him to compare.

bentsherman · 2024-03-26T18:18:30Z

Folks, since the automatic cleanup is not going to make it into core Nextflow until the resumability is implemented, I found a way to provide the basic cleanup functionality in a plugin:

https://github.com/bentsherman/nf-boost

The README has everything you need to use it. I will also publish it to the plugins index soon.

Please feel free to use it, just keep in mind that resume isn't supported yet and the cleanup itself is experimental. I haven't tested it on very large pipelines, I believe it is robust, but it might still need some performance tuning.

I would love to get some testing feedback from anyone who is interested. If you run into any problems, you can submit an issue on the nf-boost repo and I'll work with you to resolve it. Any fixes / improvements we make over there will make it into the final implementation here.

I'll keep working to get resume to work correctly, but since I don't know how long it will be until it's merged into Nextflow, I wanted to give you guys a stopgap solution based on what I have so far.

Happy cleanup! 🧹

pditommaso added the feature request label Sep 19, 2017

This was referenced Jan 7, 2019

Option to change hashing behavior #880

Closed

Cleanup of work directory without losing command logs #668

Closed

JohnHadish mentioned this issue May 24, 2019

fastq_merge very inefficient SystemsGenetics/GEMmaker#108

Closed

pditommaso mentioned this issue Jul 16, 2019

wr as new Nextflow backend #1088

Closed

JohnHadish mentioned this issue Aug 27, 2019

Cleanup of failed folders SystemsGenetics/GEMmaker#80

Closed

pditommaso pinned this issue Oct 19, 2019

pditommaso mentioned this issue Nov 6, 2019

Replace file system as source for status data to enable at scale compute tasks #1245

Closed

pditommaso mentioned this issue Jan 13, 2020

Automatically clean up AWS Batch temporary folder #1450

Closed

pditommaso mentioned this issue Apr 26, 2020

Add keep-logs to cleanup config setting #1562

Closed

stale bot added the wontfix label Apr 27, 2020

pditommaso added stale pinned and removed wontfix labels Apr 27, 2020

stale bot removed the stale label May 1, 2020

pditommaso mentioned this issue Oct 13, 2020

Remove single processed files before the end of the whole process #1754

Closed

fmorency mentioned this issue Oct 29, 2020

Nested process composition #1781

Closed

replikation mentioned this issue Nov 3, 2020

Reducing the number of files? replikation/What_the_Phage#116

Closed

pditommaso mentioned this issue Dec 23, 2020

Feature request: Make it easier to clean work directories seqeralabs/nf-tower#279

Open

bentsherman mentioned this issue Mar 31, 2023

Add support for temporary output paths #3818

Closed

bentsherman linked a pull request Mar 31, 2023 that will close this issue

Add support for temporary output paths #3818

Closed

bentsherman mentioned this issue Apr 6, 2023

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Closed

bentsherman linked a pull request Oct 4, 2023 that will close this issue

Automatic task cleanup #3849

Draft

5 tasks

bentsherman removed the feature label Oct 31, 2023

AlexVCaron mentioned this issue Feb 23, 2024

Running multiple pipelines with nf-scil on clusters on a lot of subjects scilus/nf-scil#48

Open

lindenb mentioned this issue Mar 3, 2024

Feature request cache = file_only #4791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically delete files marked as temp as soon as not needed anymore #452

Automatically delete files marked as temp as soon as not needed anymore #452

andreas-wilm commented Sep 15, 2017 •

edited by pditommaso

Loading

pditommaso commented Sep 19, 2017

pditommaso commented Aug 28, 2018 •

edited

Loading

lucacozzuto commented Oct 2, 2018 •

edited

Loading

lucacozzuto commented Oct 2, 2018

PeteClapham commented Jul 23, 2019

pditommaso commented Jul 26, 2019

lucacozzuto commented Sep 20, 2019 •

edited

Loading

stale bot commented Apr 27, 2020

fmorency commented Oct 23, 2020

olavurmortensen commented Nov 9, 2020

jvivian-atreca commented Nov 23, 2020 •

edited

Loading

jfy133 commented Aug 18, 2022

emosyne commented Aug 18, 2022

bobamess commented Aug 18, 2022

jgarces02 commented Sep 9, 2022

bentsherman commented Sep 12, 2022

spficklin commented Sep 12, 2022

hw538 commented Dec 10, 2022

bentsherman commented Dec 12, 2022

spvensko commented Dec 13, 2022

bentsherman commented Dec 13, 2022

mribeirodantas commented Dec 13, 2022

spvensko commented Dec 22, 2022

bentsherman commented Mar 31, 2023

stevekm commented Feb 2, 2024

bentsherman commented Feb 2, 2024

ewels commented Feb 2, 2024

lescai commented Feb 3, 2024

lucacozzuto commented Feb 3, 2024

pinin4fjords commented Feb 5, 2024

pditommaso commented Feb 5, 2024

spvensko commented Feb 5, 2024 •

edited

Loading

bentsherman commented Feb 5, 2024

bentsherman commented Mar 26, 2024

Automatically delete files marked as temp as soon as not needed anymore #452

Automatically delete files marked as temp as soon as not needed anymore #452

Comments

andreas-wilm commented Sep 15, 2017 • edited by pditommaso Loading

pditommaso commented Sep 19, 2017

pditommaso commented Aug 28, 2018 • edited Loading

lucacozzuto commented Oct 2, 2018 • edited Loading

lucacozzuto commented Oct 2, 2018

PeteClapham commented Jul 23, 2019

pditommaso commented Jul 26, 2019

lucacozzuto commented Sep 20, 2019 • edited Loading

stale bot commented Apr 27, 2020

fmorency commented Oct 23, 2020

olavurmortensen commented Nov 9, 2020

jvivian-atreca commented Nov 23, 2020 • edited Loading

jfy133 commented Aug 18, 2022

emosyne commented Aug 18, 2022

bobamess commented Aug 18, 2022

jgarces02 commented Sep 9, 2022

bentsherman commented Sep 12, 2022

spficklin commented Sep 12, 2022

hw538 commented Dec 10, 2022

bentsherman commented Dec 12, 2022

spvensko commented Dec 13, 2022

bentsherman commented Dec 13, 2022

mribeirodantas commented Dec 13, 2022

spvensko commented Dec 22, 2022

bentsherman commented Mar 31, 2023

stevekm commented Feb 2, 2024

bentsherman commented Feb 2, 2024

ewels commented Feb 2, 2024

lescai commented Feb 3, 2024

lucacozzuto commented Feb 3, 2024

pinin4fjords commented Feb 5, 2024

pditommaso commented Feb 5, 2024

spvensko commented Feb 5, 2024 • edited Loading

bentsherman commented Feb 5, 2024

bentsherman commented Mar 26, 2024

andreas-wilm commented Sep 15, 2017 •

edited by pditommaso

Loading

pditommaso commented Aug 28, 2018 •

edited

Loading

lucacozzuto commented Oct 2, 2018 •

edited

Loading

lucacozzuto commented Sep 20, 2019 •

edited

Loading

jvivian-atreca commented Nov 23, 2020 •

edited

Loading

spvensko commented Feb 5, 2024 •

edited

Loading