-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically delete files marked as temp as soon as not needed anymore #452
Comments
I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell |
I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature. First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) This may require to introduce a new |
My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed). remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID] In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder. |
Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems. |
So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ? |
Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something? |
I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow. Of course this will not work if these files will be needed by other processes. Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I like the ideas in this thread. Automatic removal of "intermediate process files" would be great. |
This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power. |
I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing — the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution. That said, is there any reason that temporary file cleanup isn't the purview of the |
https://www.nextflow.io/docs/latest/config.html?highlight=cleanup#miscellaneous |
thanks a lot. |
I usually put |
@bentsherman's solutions seems very attractive (clean_work_files.sh). Is there any possibilities to include it within next versions? (or how can I tweak sarek to include it?) |
@jgarces02 for now you have to wire the dependencies yourself, see the GEMmaker pipeline script for an example. I'm currently trying to automate this behavior by specifying e.g. |
Wonderful @bentsherman ! |
any exciting news about this feature request? :) |
This feature has been on the backburner this year due to other pressing efforts, but we're finally beginning to make some headway. I'm currently working on a PR (#3463) that will allow Nextflow to track the full task graph, which will comprise the "first half" of this feature (but it's also useful for other things like provenance). The second half will be to use the task graph to figure out when an output file can be deleted, something like:
Still kinda fuzzy about point (3). but I think there are a number of possible ways to do it. |
I am currently working on a blog post that will hopefully be published either later this week or early next week to go over examples of implementing GEMMaker's We've implemented this into our rather large neoantigen workflow (LENS) and it appears it will save us tons of storage. |
@spvensko that's great, I agree it would be good to have a general example for people to reference in the meantime. I'd like to have such an example for the Nextflow patterns website (or wherever that content ends up in the website revamp), but I never got around to writing it myself. Looking forward to your blog post. |
Please share it when it's done, @spvensko 😄 |
Blog post is available now: https://pirl.unc.edu/blog/tricking-nextflows-caching-system-to-drastically-reduce-storage-usage I'm going to be on PTO for the rest of the year, so hopefully there aren't any major issues with it. 😅 |
Folks, it's happening: #3818 Basically a minimal implementation of GEMmaker's "clean work files" approach directly into Nextflow. Several caveats and limitations to consider, but even this piece should be enough to make production pipelines much more storage efficient. Testing and feedback are appreciated! Feel free to message me on Slack if you don't want to clog up this issue. |
@bentsherman just wanted to follow up, is this feature 100% complete? was not sure since this Issue is still marked as Open. Thanks. |
The automatic cleanup works but the resumability still has some issues. I had to focus on other things for a while but I have picked up this effort again, hope to finish the resumability in the next few months. See #3849 for updates. If there are lots of people who don't care about the resumability piece, I could push to have the basic cleanup merged ASAP and complete the resumability in a separate effort. That would mean that for now, if you enable automatic cleanup and e.g. your pipeline fails half-way through due to some bug, you might not be able to resume because some task outputs will have been deleted. cc @pditommaso @marcodelapierre for their thoughts |
I would be in favour of getting the automatic cleanup feature in ASAP, with or without resumability 👍🏻 For quite a few people this can make the difference between being able to run a pipeline at all or not, at which point being able to resume it is purely a nicety. We should definitely aim to have the full cake, but getting in the basic cleanup quickly would be very nice. |
I agree with @ewels |
I also agree. To resume a pipeline is important but in some context you cannot even run the pipeline for lack of space |
Agreed. Getting the footprint down to the point of feasibility where it's currently lacking is a worthwhile goal even sacrificing resumability in the short term. |
Disagree. The reusability of pipelines is not a feature that can be compromised. |
It's worth noting that Stephen Ficklin's (@spficklin) solution allows intermediate file deletion (either in line with the workflow or at the end, depending on how it's coded) and resumability. There are limitations and it's relatively tedious to implement manually, but it's completely possible for us to have our cake and enjoy a slice or two. |
Folks, since the automatic cleanup is not going to make it into core Nextflow until the resumability is implemented, I found a way to provide the basic cleanup functionality in a plugin: https://github.com/bentsherman/nf-boost The README has everything you need to use it. I will also publish it to the plugins index soon. Please feel free to use it, just keep in mind that resume isn't supported yet and the cleanup itself is experimental. I haven't tested it on very large pipelines, I believe it is robust, but it might still need some performance tuning. I would love to get some testing feedback from anyone who is interested. If you run into any problems, you can submit an issue on the nf-boost repo and I'll work with you to resolve it. Any fixes / improvements we make over there will make it into the final implementation here. I'll keep working to get resume to work correctly, but since I don't know how long it will be until it's merged into Nextflow, I wanted to give you guys a stopgap solution based on what I have so far. Happy cleanup! 🧹 |
To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?
Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.
Andreas
The text was updated successfully, but these errors were encountered: