Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for temporary output paths #3818

Closed
wants to merge 17 commits into from
Closed

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Mar 31, 2023

Closes #452

Adds a temporary option to path outputs. See the docs and e2e test for details.

Notes:

  • It is currently a coarse-grained approach. A temp file is deleted when all consuming processes of the file's originating process are finished. As a result, temp files may not be deleted as soon as possible. I will investigate more fine-grained approaches, like tracking consumers at the level of channels or tasks, but they will be more complex. Temp file lifetimes are now determined by downstream tasks.
  • There is no additional validation on resumed runs. I empty the file contents and preserve the metadata, so temp files can be cached, but I do not try to verify that all downstream tasks are also cached. Again, I will investigate it, but IMO it is primarily valuable during development and not so much in production.
  • Directories and remote paths (e.g. S3) aren't supported yet. I'm working on it! All paths are supported now.

Here is how you can test this feature using the e2e test:

$ rm -rf work ; ./launch.sh tests/temporary-outputs.nf 
N E X T F L O W  ~  version 23.03.0-edge
Launching `tests/temporary-outputs.nf` [grave_miescher] DSL2 - revision: de1ab365d0
executor >  local (9)
[ac/b7c06a] process > foo (1) [100%] 3 of 3 ✔
[4a/3bcb01] process > bar (3) [100%] 3 of 3 ✔
[77/fbd3d2] process > baz (3) [100%] 3 of 3 ✔

$ for file in `find work -type f -not -type l -name '*.txt'` ; do echo $file ; cat $file ; done
work/d6/14e2824f038602ebad8af4deb1fc16/c.txt
foo was here
bar was here
baz was here
work/d0/7441dc2e8fde0d279b48bd144f44f0/a.txt
work/77/fbd3d28c610f04701ed9a036b408c4/c.txt
foo was here
bar was here
baz was here
work/4a/3bcb018caccd9c2ca49ef67d362722/b.txt
work/3a/c0a004a645c35c7ed4baedc54e8470/b.txt
work/ac/b7c06ab493f58ef1556cc8081ecb26/a.txt
work/94/f31f20d98868f9d056dc1bd356be9a/c.txt
foo was here
bar was here
baz was here
work/65/b602b5f2bdfef95a61a163d33aa924/a.txt
work/22/aad7d4b922acadc15598484f0ffa63/b.txt

$ ./launch.sh tests/temporary-outputs.nf -resume
N E X T F L O W  ~  version 23.03.0-edge
Launching `tests/temporary-outputs.nf` [cranky_boltzmann] DSL2 - revision: de1ab365d0
[ac/b7c06a] process > foo (1) [100%] 3 of 3, cached: 3 ✔
[4a/3bcb01] process > bar (2) [100%] 3 of 3, cached: 3 ✔
[77/fbd3d2] process > baz (3) [100%] 3 of 3, cached: 3 ✔

After the first run, we inspect the work directory and find that all of the output files that were declared with temporary: true in the pipeline are now empty. On a resumed run, everything is cached. In this case, a run can be safely resumed as long as all of the baz can be cached. If you modify the baz process or delete/modify any of the c.txt files, resuming the run will produce incorrect output.

@bentsherman
Copy link
Member Author

Some additional thoughts:

  • I considered improving the cleaner by tracking the channel consumers rather than process consumers, but now I'm not sure that would be safe.

    Consider a process that declares two output channels A and B, A is temporary, and they capture the same files. B's files will be deleted even though it isn't declared temporary -- that is a caveat unto itself. If I track consumers at the channel level rather than process level, I only have to wait for A's consumers to finish. But really I need to wait for B's consumers too, because they're using the same files.

  • While directories aren't supported, you can probably make it work by emitting the actual list of files in the directory in your pipeline script. I should be able to make it work directly by walking the directory.

  • Remote paths aren't supported ATM because I haven't figured out how to "empty" a file through the Path API. I can delete the contents and reset the modified timestamp, but I haven't found a way to reset the file size. If someone figures out how to do it, please let me know!

    In the meantime, you can probably make it work with remote paths by using Fusion 😄

@ewels
Copy link
Member

ewels commented Apr 1, 2023

This is awesome! 👏🏻 😎

Is there any need to mark outputs as temporary? If we're using publishDir, is it not true that all work dir files are temporary? I had envisaged this working a bit like cleanup = true, but just cleaning as you go along.. Doing it without this statement would be preferable as then people could opt in to it without needing to make any pipeline edits. Reading the upstream issues, I guess that this option is modelled on Snakemake - but Snakemake doesn't have work directories and publish directories, so it needs this to know which files can be discarded. I don't think that we do.

Following the cleanup = true analogy - how realistic is it to try to keep resume functionality working? For example, if Process A files are deleted, then Process B is edited and the pipeline is rerun won't it fail in weird and unhandled ways? [edit: Just saw that you mentioned this in the docs, so yes] I just wonder if this complicates the matter a lot and we could get away with just completely removing the work directory with less hassle 😬

@mbosio85
Copy link

mbosio85 commented Apr 3, 2023

@bentsherman I have a couple of questions, how would this act on those files that are generated by a process but are not defined as output:? Are they affected by this PR or they stay in the workdir?

I understand that all those intermediate files which are not used downstream can be safely removed upon the process completion, without affecting the resume functionality.

@bentsherman
Copy link
Member Author

@ewels That's a fair point, it would be nice to simply say cleanup = true and let Nextflow figure out which files are temporary. Indeed, as long as outputs aren't published via symlink, everything in the work directory can be deleted.

Hmm, it seems that while the task outputs are published before the "on task complete" / "on process terminate" events are sent (which the temp file cleaner uses to trigger cleanup), publishing is asynchronous, so we would also need an event for when publishing is complete for a given task / process.

In any case, I think I will get rid of the "empty file" trick and just delete the file, much easier to support directories and object storage that way. I don't think it's strictly needed for resumability. But I would like to know how important resumability is to the community. Paolo says he really wants it, but I suspect that in production, where the automatic cleanup is most useful, being able to resume is not as important because you aren't fixing bugs, etc.

If the task cache has all the necessary information, and if we can distinguish between a task that was deleted vs a task that was modified, then the resume should be able to skip tasks that were deleted (but not otherwise modified) as long as downstream tasks are also cached. But I think we could go ahead and ship a basic automatic cleanup feature, then try to add resumability in a separate PR.

@ewels
Copy link
Member

ewels commented Apr 3, 2023

Yup, agree on all points 👍🏻

@bentsherman
Copy link
Member Author

@mbosio85 If an output file isn't captured by an output: declaration, it can safely be deleted when that task completes. This happens by default when using process.scratch=true, because the file is never copied back to the shared work directory. If someone isn't use scratch, they can explicitly delete the file in the process script to free up space in their work directory.

@bentsherman
Copy link
Member Author

I figured out how to track the downstream tasks of a temporary output -- when a process "closes" (i.e. all tasks have been created), we can inspect which tasks use a temporary file and be certain that we found all of them. So now each temp file can be deleted much sooner.

On top of that, we can save the list of downstream tasks for each task and use it during the resume. To do this, we have to compute separately the task "inputs" hash and task "outputs" hash. The inputs/script/config of a task must be cached no matter what, but if any outputs are missing, we can traverse the list of downstream tasks and see if they are cached. We can traverse the entire task dependency graph, as long as we end up at leaf nodes that are cached.

Basically, we want the .nextflow to store all the key information encoded in the work directory, so that you could delete entire task directories and still recover them from the .nextflow cache as well as downstream tasks.

These ideas should all apply to the global cleanup option as well, including resumability. But with the global cleanup we also have to wait for files to be published. So I'm going to explore the resumability in this PR for now, and then I will translate it to the "eager" cleanup PR. The end goal is to have the global cleanup with resumability, then I think we'll be good to go.

@bentsherman bentsherman changed the base branch from ben-task-graph to master April 28, 2023 19:16
@bentsherman bentsherman force-pushed the 452-temporary-outputs branch from cf5f8df to 9760f4f Compare April 28, 2023 19:26
@bentsherman bentsherman force-pushed the 452-temporary-outputs branch from 9760f4f to 9637e34 Compare April 28, 2023 20:15
@fgualdr
Copy link

fgualdr commented May 10, 2023

This feature will be awesome!
We are dealing with this issue in productivity when space fills up of intermediate files.
When will be released ... we are all thrilled ... and desperate... :-)

@bentsherman
Copy link
Member Author

The "eager" cleanup PR now has the same capabilities as this one. In particular, it can eagerly delete individual output files in addition to task directories. This piece was important because output files can often times be deleted sooner than task directories. I was going to implement resumability here first and then port it to the other PR, but now we can cut out the middle man 😄

On to resumability...

Closing in favor of #3849 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically delete files marked as temp as soon as not needed anymore
4 participants