Replies: 1 comment
-
I understand this can be seen an unnecessary limitation, however are technical reasons why it's not supported. First of all some context. Module binaries is an opt-in feature because it's not generally supported by all executors. Therefore we requires the users to enable it explicitly to be aware of the exiting limitations. Currently it's supported by 1) local and grid executors and 2) cloud executors via Wave. The reason for this relies in the underlying storage used by each executor. In the first case, since those "binary" accessible either in the local or shared file system, it's enough adding those path to the task When using a cloud executor, things are more complicated because the lack of a shared file system would require to upload those scripts in the object storage and then download in the container somewhere accessible to the task You would argue that is what's done for workflow level One reason is that in case of workflow binaries, being a single, predictable path, it's fairly simple to upload to a bucket, and then download in the tasks. Having many modules each of them with its own binaries, would required to handle different namespaces and some synchronization strategy to avoid multiple runs of the same process to upload the same binaries multiple times, etc. Not impossible, but unneeded complexity having a better alternative. Another reason is that the ability to share the workflow However this approach, it's extremely fragile, hard to configure, error-prone, sub-optimal in term of performance and an hassle to maintain. Without counting that it requires a different ugly glue code, like the one you are reported, for each different cloud vendor. This the reason we decided to implement the support for modules binaries for cloud executors using Wave. This approach allows us to rely on the augmentation capabilities to include the tasks binaries directly in the container, with the following advantages:
Yes, requires an extra service, however it's provided OSS and free of charge to any Platform user or it cab even be installed on-prem. I think it's a fair assumption relying on a web-service when working with cloud and containers, otherwise not even docker.io should be an option. Last but not least, we consider the access of S3 files in nextflow batch jobs via command line (similarly for other cloud) a legacy solution for the reasons mentioned (unreliable, hard to configure and maintain, etc) and plan to converge more over the years to service-based solutions in which Wave and Fusion are central components to deploy pipelines at scale in the (any) cloud. |
Beta Was this translation helpful? Give feedback.
-
I am investigating the usage of Nextflow's module binaries to be able to bundle per-module
bin
directory and scripts, as described here https://www.nextflow.io/docs/latest/module.html#module-binariesNotably, it seems like when using cloud execution such as AWS Batch, it is a requirement to use Wave containers;
This seems like an arbitrary and unnecessary requirement. Especially considering that the top-level pipeline
bin
directory does not have this requirement. In fact if you inspect the Nextflow .command.run file you can see how this one is handled;Its clear that Nextflow already has the capability to upload the workflow
bin
dir to S3, then pull it back down inside the running Nextflow job and add it to the$PATH
during execution. So it does not really make sense why such a heavyweight solution like Wave would be required to do the exact same thing for the module'sbin
dir."Just use Wave containers" is not really a good solution for this considering that Wave is an external service dependency, which was not previously needed for this same functionality elsewhere in Nextflow, and usage of Wave is not feasible if your infrastructure has other considerations that are not addressed by the implementation of Wave. So the end result is, if you cannot run Wave, or if Wave is not available, you just cannot use this Nextflow feature. This does not seem like a good position to be in, as a pipeline developer.
I think a better solution for this, would be that Nextflow just uses the same behaviour for module binaries as it does for the workflow's
bin
dir, cache it in S3 during execution and pull it back down locally on the job's EC2 and add it as an extra location in PATH. Its not really clear to me why this is not already the default behavior. Maybe we can get this behavior enabled as an option instead of forcing Wave usage?Beta Was this translation helpful? Give feedback.
All reactions