-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Walk file tree on directory publish #3933
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Sherman <[email protected]>
81f7cb7
to
8a43489
Compare
Signed-off-by: Ben Sherman <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Ben Sherman <[email protected]>
This PR is ready for review. The publish dir now publishes each file in a directory instead of just the directory, which should improve detection of Reports in Seqera Platform and fix some issues with Fusion symlinks (#4725). It does not yet solve #3372 , for that we also need to verify the checksum of published files when deciding whether a file needs to be re-published. I can implement that here or in a separate PR, up to you @pditommaso . |
Looks good, thanks Ben! In the case where publish method is symlinks and we are emitting/publishing a directory, what does the published output look like? Is it a symlinked directory full of symlinks, or a regular directory full of symlinks, or something else entirely? |
It will be a regular directory of symlinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should do this. On some storage traversing a directory path is extremely expensive e.g. Lustre, NFS, etc.
I share your concern, although copying a directory also involves walking the directory tree at some level, so may it's not an issue. Either way, there are other reasons we should do this. In general, I think it is a bad practice to define a directory output. When I see a process with a directory output, I know nothing about the file structure within that directory. So I think we should either not support directory outputs or support them correctly. Currently we just support it incorrectly:
The performance implications you mention only happen if the user decides to use directory outputs.. I think we could simply document this point and encourage users to use flat files as much as possible. Consider also that we already walk the work directory to collect the output files: nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy Line 1654 in 279c410
which has similar performance issues when there are nested directories |
Indeed, when we publish a directory we are already traversing it:
This PR is just doing the traversal earlier |
@robsyme what do you think about what I said about publishing a symlink directory? It is certainly a change in behavior, but I'm wondering if it's more desirable to symlink all of the files rather than only symlink the directory. If not, I can modify the logic to only traverse the directory when the publish mode isn't symlink, but I feel like that would make the behavior inconsistent based on the publish mode. |
Thanks for moving this forward Ben, and apologies for my slow response times! My feeling is that we're stuck between a rock and a hard place, but inconsistencies between publishDir modes is the worse outcome. As for
I think it's inevitable that people will want to publish results that contain directories (many tools have complex/nested outputs). |
Indeed, even though a directory output is harder to reason about, it seems to be convenient for the first iteration of a Nextflow process for certain tools |
I think this is solved by #5005 |
The fusion symlinks are no longer an issue, but there are other reasons to merge this:
|
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
@bentsherman What's the status of this fix? Encountering is systematically with the https://github.com/nf-core/spatialvi pipeline. Alternatively, any known workaround? |
The PR works but I'm not sure if we still want to merge it or not. We might take a different approach. I'm not sure why your issue is happening because you don't need to create directories beforehand in S3. It could be a different bug in Nextflow. Feel free to submit an issue here if you aren't able to solve it. A workaround to publishing a directory is to have a tuple of paths for each individual file in the directory. But this gets very impractical if there are many files |
Close #3372
I just changed the publish dir to explicitly publish each file in a directory. I think it means that if the publish mode is symlink, each file in the directory will be symlinked instead of just the directory.
@robsyme do we just need to add a checksum comparison to S3 copies and uploads? I think I can just rip the code from this PR #3802