Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publishDir checksum method #2844

Open
matthdsm opened this issue May 3, 2022 · 12 comments
Open

publishDir checksum method #2844

matthdsm opened this issue May 3, 2022 · 12 comments
Labels

Comments

@matthdsm
Copy link
Contributor

matthdsm commented May 3, 2022

New feature

Hi

I'd like to propose a new directive for publishdir.
It would be useful to have a checksum directive which generates a checksum for the files it writes out. This can be a boolean or perhaps a closure to further filter which files get checksummed.

Usage scenario

Checksums are useful in a scenario where data needs to be verified for transfer of archiving. Having a built-in would streamline this process.

I'm aware this could be easily done with a module, but it seems like this would be in place here too.

Suggest implementation

Add directive to publishDir to generate file.txt.md5 for file.txt

@apeltzer
Copy link
Contributor

apeltzer commented May 3, 2022

Maybe allowing both sha256sum and md5sum should be something to consider.

@matthdsm
Copy link
Contributor Author

matthdsm commented May 3, 2022

Could be a flag 👍🏻

@ewels
Copy link
Member

ewels commented May 3, 2022

x-ref #2491 which is relevant but different (here we are talking about generating checksums for outputs, that issue is about checking checksums for inputs).

#2676 is probably a duplicate of this issue, though it's a little unclear what the OP wanted there.

@stale

This comment was marked as outdated.

@stale stale bot added the stale label Oct 1, 2022
@ewels
Copy link
Member

ewels commented Oct 1, 2022

Could tie this into a plug-in that outputs a JSON of all publishDir files

@stale stale bot removed stale labels Oct 1, 2022
@matthdsm
Copy link
Contributor Author

matthdsm commented Oct 1, 2022 via email

@stale

This comment was marked as outdated.

@stale stale bot added the stale label Mar 18, 2023
@ewels

This comment was marked as outdated.

@stale stale bot removed the stale label Mar 18, 2023
@ewels ewels added the pinned label Mar 18, 2023
@pditommaso pditommaso added this to the 23.10.0 milestone Mar 18, 2023
@bentsherman
Copy link
Member

My understanding is you want an option to create a *.md5 alongside each file that is published?

For what it's worth, S3 / Google Storage / Azure Blob all automatically compute the MD5 hash of each object when it is uploaded, and it is saved to the object metadata. S3 provides it via the "ETag", can't recall how the other two provide it.

@ewels
Copy link
Member

ewels commented Aug 11, 2023

Correct 👍🏻 Could also be a single file with all md5s for all files. Or a JSON file together with other provenance info as suggested above. Not super fussed where the info is, as long as it's calculated and saved.

Is the blob storage calculated pre- or post- upload? For my paranoid brain it really needs to be computed locally, next to where it was generated, to be fully useful. Also, lots of people don't use those platforms.

My personal use case for this back in the day was HPC, where we had a custom script to do this post-run (in some cases, but would have been nice for all).

@bentsherman
Copy link
Member

In that case, it should be covered by #3802 or #3849 , whichever one gets merged first. Both of them depend on saving task inputs/outputs metadata to the cache, including the MD5 hash.

I think the cloud providers always compute the checksum server-side, and you can also provide the client-side checksum in the upload and they will verify it for you. Not sure if Nextflow does this currently, but I will make sure it will as part of these upcoming features around task provenance.

@ewels
Copy link
Member

ewels commented Aug 13, 2023

Sounds great! Happy for this issue to be closed in that case, unless it's helpful to keep open as a reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants