Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document remote file staging #5523

Merged
merged 10 commits into from
Dec 19, 2024

Conversation

christopher-hakkaart
Copy link
Contributor

To address #5493

@bentsherman - I'll need some help regarding:

  • Details of hash generation.
  • Details of caching behavior.

I've commented on where I would add these.

Copy link

netlify bot commented Nov 19, 2024

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit bac899e
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6764317548f03e0008facbb7
😎 Deploy Preview https://deploy-preview-5523--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@christopher-hakkaart
Copy link
Contributor Author

@bentsherman - can you point me in the direction of the code for hash generation and caching behavior so I can finish this PR?

@bentsherman
Copy link
Member

Those sections will be tricky to write. I will try to finish them this week

@christopher-hakkaart
Copy link
Contributor Author

All good! Thanks!

@bentsherman bentsherman linked an issue Dec 12, 2024 that may be closed by this pull request
@bentsherman bentsherman changed the title Add file download description Document remote file staging Dec 12, 2024
Signed-off-by: Ben Sherman <[email protected]>
@bentsherman bentsherman marked this pull request as ready for review December 12, 2024 16:33
@bentsherman bentsherman requested a review from a team as a code owner December 12, 2024 16:33
Copy link
Member

@bentsherman bentsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopher-hakkaart it's your PR but let me know if the changes look good

Signed-off-by: Ben Sherman <[email protected]>
@bentsherman
Copy link
Member

@pditommaso if you could look over this for technical correctness. We are documenting the remote file staging. Also a nice place to highlight the need for Fusion

Signed-off-by: Christopher Hakkaart <[email protected]>
@christopher-hakkaart
Copy link
Contributor Author

Thanks @bentsherman

docs/working-with-files.md Outdated Show resolved Hide resolved
docs/working-with-files.md Outdated Show resolved Hide resolved
docs/working-with-files.md Outdated Show resolved Hide resolved
docs/working-with-files.md Outdated Show resolved Hide resolved
christopher-hakkaart and others added 4 commits December 13, 2024 13:51
Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Chris Hakkaart <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Chris Hakkaart <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Chris Hakkaart <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
Signed-off-by: Chris Hakkaart <[email protected]>
Comment on lines 260 to 262
When a remote file is passed as an input to a process, Nextflow stages the file into the work directory using an appropriate Java SDK.

Remote files are staged in a subdirectory of the work directory with the form `stage-<session-id>/<hash>/<filename>`, where `<hash>` is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to better this kind of file transfer happens every time the origin or the destination of the file system is different from the workflow work directory.

For example, the input file is in the local computer or it's a http remote file AND the pipeline uses S3 bucket as work dir, then nextflow needs to copy into S3. Same logic when it needs to copy the output files.

For the same reason it's important to advice to keep the input and outputs in the same storage system e.g. S3 or shared file system.

Minor: it would be preferable use "copy" or download input files and upload output files instead instead of "stage" that's too slang tech term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to call it "remote file staging" in summary because it is concise. But I will explain it as copying

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with staging definition, however the part how improve the definition of "remote" file. A local file it's consider remote if the work dir is, for example, S3.

@bentsherman bentsherman merged commit f935cbf into nextflow-io:master Dec 19, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docs request: Fetching remote files
3 participants