Deep caching without actually pulling the files #5612

jchorl · 2024-12-13T23:50:16Z

jchorl
Dec 13, 2024

New feature

I've been thinking about caching and optimization. One interesting challenge is reusing cache even if earlier steps in the pipeline must re-run.

For pipelines that run repeatedly (e.g. in CI), this could yield enormous savings. Many changes may change the logic of a process but the outputs may stay the same. In particular, with complex pipelines, it is not necessarily the case that changing an early process invalidates all later processes. I'm not suggesting nextflow magically guess whether to re-run a process, but rely on deterministic properties like file hash.

I played around with nextflow's deep caching feature to achieve this, and it is possible. But I believe it is quite suboptimal in a cloud architecture. I'm going to make some assumptions here, forgive me if they are incorrect.

Consider this pipeline:

workflow {
  Channel.of('bob', 'sally') | proc_a | proc_b | view
}

Suppose nextflow is running with a cloud bucket-dir and cloud cache:

proc_a runs remotely, and stages out its outputs
proc_b runs remotely, stages in the prior process' outputs, does work, stages out its own outputs
Now suppose you change proc_a and re-run
It stages out its outputs
The only way nextflow can know if the proc_b inputs have changed is by actually reading and hashing the proc_a outputs
- cache: true isn't a great fit here, because the timestamps on proc_a outputs will surely change
- cache: lenient isn't a great fit here, because proc_a may have outputted files with the same size but different contents. This feels supremely dangerous.
- cache: deep isn't a great fit, because now nextflow needs to pull the proc_a output (which could be very large!) to see if it has changed. Imagine pulling a 10gb file cross-region just to check if the cache is valid!

How can you get deep caching semantics without actually pulling the file and hashing it?

Capture a hash on stage-out and store that in the cache-db. This seems pretty hard e.g. when using the aws cli to stage out, I'm not sure if it has that ability. But you could imagine you only get the benefits if you use fusion, and that captures the hash.
Leverage the storage backend to provide you with a hash
- Obviously standard filesystems can't do this. But hashing a local file is generally fast anyway.
- Many cloud storage platforms do have proxies for this. S3 returns an etag on a head-object request. GCS get requests also have various hashes available.

I'm curious if you've given thought to this problem.

While leveraging cloud-based properties like etag is not cloud-agnostic, it's semi-ubiquitous and I think strikes a good tradeoff between ubiquity, utility and cost. One could imagine deep-caching using etag if available, and falling back to a proper file hash if it is not available. Or that could be a configurable option. Or an entirely new caching strategy (hopefully not!).

Anyway, just wanted to float the idea.

bentsherman · 2024-12-14T04:43:39Z

bentsherman
Dec 14, 2024
Maintainer

As a matter of fact I have thought about this, across multiple different efforts:

These PRs have been lying dormant only because I've been focused on the language server, and I released nf-boost as a stopgap in the meantime.

I reached basically the same conclusions as you. If we save the hash of each output file, it should be possible to delete intermediate files, or not re-publish a file, or not re-fetch a previous output, etc, as long the intermediate hashes allow us to recover the downstream outputs.

This has all the same challenges you described. And I do suspect we'll end up with some hybrid solution that is simple and slow by default but becomes much faster if you use Fusion. Stay tuned 😉

2 replies

jchorl Dec 14, 2024
Author

Thank you for the response and the links to the issues. It's exciting to see a lot of great ideas and work.

From looking at the issues, they look semi-related but it's not clear to me they cover the caching use-case. From the boost functionality, it looks like you're cleaning up temp files in the workdir that are no longer used. I guess the relevant part here is the observer that marks and sweeps unused files, because it doubles as a database (kind of?).

Similarly for task provenance (which I think is an awesome idea), the value seems to be in tracking the inputs/outputs.

But in the example given above, to leverage the cached proc_b even when proc_a changes, we'd need to know that the old output of proc_a for which proc_b initially ran, is the same as the new output of proc_a, so there's no reason to re-run proc_b. For that, I assume you need a bit more than just tracking files. Either the cache hash entry would need to include the hash of the file (or etag) so you get a safe, intentional hash collision. Or you'd need to have the filename actually be the hash - fairly common e.g. in Bazel that does a lot of CAS - so again you get a safe, intentional hash collision. Or you'd need to loop over other cache entries and check if the input hashes collide (because the paths would not collide, as proc_a has rerun in a new workdir) - seems much less good.

So I think the ticket is different (maybe an extension?) of the links you sent. It would be so nice to achieve a state where, when running nextflow fresh on any machine with no restored state, if a process has ever run where all the inputs match anywhere (configured with same cache backend), the new run leverages the cache.

bentsherman Dec 14, 2024
Maintainer

Each of these PRs hold a piece of the puzzle, we just need to put the pieces together. But I guess you are also talking about regular caching, but extended so that the same cache can be shared across different runs/environments.

This is also a problem we would like to solve. The task hashes right now include a lot of things like the session ID, process name, etc, which in principle could be omitted, but in practice are included to avoid race conditions against the filesystem. If we can have some kind of thread-safe cache DB, or maybe just make better use of mutex, it should be possible for a task cache entry to be re-used across different processes, different pipelines, even entirely different environments as long as they can access the shared cache.

But right now there is an artificial restriction that a given cache entry can only be used by a specific process within a specific session ID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep caching without actually pulling the files #5612

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Deep caching without actually pulling the files #5612

jchorl Dec 13, 2024

New feature

Replies: 1 comment · 2 replies

bentsherman Dec 14, 2024 Maintainer

jchorl Dec 14, 2024 Author

bentsherman Dec 14, 2024 Maintainer

jchorl
Dec 13, 2024

Replies: 1 comment 2 replies

bentsherman
Dec 14, 2024
Maintainer

jchorl Dec 14, 2024
Author

bentsherman Dec 14, 2024
Maintainer