Replies: 1 comment 2 replies
-
As a matter of fact I have thought about this, across multiple different efforts:
These PRs have been lying dormant only because I've been focused on the language server, and I released nf-boost as a stopgap in the meantime. I reached basically the same conclusions as you. If we save the hash of each output file, it should be possible to delete intermediate files, or not re-publish a file, or not re-fetch a previous output, etc, as long the intermediate hashes allow us to recover the downstream outputs. This has all the same challenges you described. And I do suspect we'll end up with some hybrid solution that is simple and slow by default but becomes much faster if you use Fusion. Stay tuned 😉 |
Beta Was this translation helpful? Give feedback.
-
New feature
I've been thinking about caching and optimization. One interesting challenge is reusing cache even if earlier steps in the pipeline must re-run.
For pipelines that run repeatedly (e.g. in CI), this could yield enormous savings. Many changes may change the logic of a process but the outputs may stay the same. In particular, with complex pipelines, it is not necessarily the case that changing an early process invalidates all later processes. I'm not suggesting nextflow magically guess whether to re-run a process, but rely on deterministic properties like file hash.
I played around with nextflow's
deep
caching feature to achieve this, and it is possible. But I believe it is quite suboptimal in a cloud architecture. I'm going to make some assumptions here, forgive me if they are incorrect.Consider this pipeline:
Suppose nextflow is running with a cloud bucket-dir and cloud cache:
proc_a
runs remotely, and stages out its outputsproc_b
runs remotely, stages in the prior process' outputs, does work, stages out its own outputsproc_a
and re-runproc_b
inputs have changed is by actually reading and hashing theproc_a
outputscache: true
isn't a great fit here, because the timestamps onproc_a
outputs will surely changecache: lenient
isn't a great fit here, becauseproc_a
may have outputted files with the same size but different contents. This feels supremely dangerous.cache: deep
isn't a great fit, because now nextflow needs to pull theproc_a
output (which could be very large!) to see if it has changed. Imagine pulling a 10gb file cross-region just to check if the cache is valid!How can you get deep caching semantics without actually pulling the file and hashing it?
I'm curious if you've given thought to this problem.
While leveraging cloud-based properties like etag is not cloud-agnostic, it's semi-ubiquitous and I think strikes a good tradeoff between ubiquity, utility and cost. One could imagine deep-caching using etag if available, and falling back to a proper file hash if it is not available. Or that could be a configurable option. Or an entirely new caching strategy (hopefully not!).
Anyway, just wanted to float the idea.
Beta Was this translation helpful? Give feedback.
All reactions