-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
publishDir
checksum method
#2844
Comments
Maybe allowing both sha256sum and md5sum should be something to consider. |
Could be a flag 👍🏻 |
This comment was marked as outdated.
This comment was marked as outdated.
Could tie this into a plug-in that outputs a JSON of all publishDir files |
Would be nice for testing!
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
My understanding is you want an option to create a For what it's worth, S3 / Google Storage / Azure Blob all automatically compute the MD5 hash of each object when it is uploaded, and it is saved to the object metadata. S3 provides it via the "ETag", can't recall how the other two provide it. |
Correct 👍🏻 Could also be a single file with all md5s for all files. Or a JSON file together with other provenance info as suggested above. Not super fussed where the info is, as long as it's calculated and saved. Is the blob storage calculated pre- or post- upload? For my paranoid brain it really needs to be computed locally, next to where it was generated, to be fully useful. Also, lots of people don't use those platforms. My personal use case for this back in the day was HPC, where we had a custom script to do this post-run (in some cases, but would have been nice for all). |
In that case, it should be covered by #3802 or #3849 , whichever one gets merged first. Both of them depend on saving task inputs/outputs metadata to the cache, including the MD5 hash. I think the cloud providers always compute the checksum server-side, and you can also provide the client-side checksum in the upload and they will verify it for you. Not sure if Nextflow does this currently, but I will make sure it will as part of these upcoming features around task provenance. |
Sounds great! Happy for this issue to be closed in that case, unless it's helpful to keep open as a reminder. |
New feature
Hi
I'd like to propose a new directive for publishdir.
It would be useful to have a
checksum
directive which generates a checksum for the files it writes out. This can be a boolean or perhaps a closure to further filter which files get checksummed.Usage scenario
Checksums are useful in a scenario where data needs to be verified for transfer of archiving. Having a built-in would streamline this process.
I'm aware this could be easily done with a module, but it seems like this would be in place here too.
Suggest implementation
Add directive to
publishDir
to generatefile.txt.md5
forfile.txt
The text was updated successfully, but these errors were encountered: