module binaries should not require Wave #5609

stevekm · 2024-12-04T21:01:26Z

stevekm
Dec 4, 2024

I am investigating the usage of Nextflow's module binaries to be able to bundle per-module bin directory and scripts, as described here https://www.nextflow.io/docs/latest/module.html#module-binaries

Notably, it seems like when using cloud execution such as AWS Batch, it is a requirement to use Wave containers;

This feature requires the use of a local or shared file system for the pipeline work directory, or Wave containers when using cloud-based executors.

This seems like an arbitrary and unnecessary requirement. Especially considering that the top-level pipeline bin directory does not have this requirement. In fact if you inspect the Nextflow .command.run file you can see how this one is handled;

nxf_main() {
....
    /home/ec2-user/miniconda/bin/aws s3 cp --recursive --only-show-errors s3://my-bucket/tmp/2d/1c2bb380285efdd60198d5406f58d0/bin $PWD/nextflow-bin
    chmod +x $PWD/nextflow-bin/* || true
    export PATH=$PWD/nextflow-bin:$PATH
.....

Its clear that Nextflow already has the capability to upload the workflow bin dir to S3, then pull it back down inside the running Nextflow job and add it to the $PATH during execution. So it does not really make sense why such a heavyweight solution like Wave would be required to do the exact same thing for the module's bin dir.

"Just use Wave containers" is not really a good solution for this considering that Wave is an external service dependency, which was not previously needed for this same functionality elsewhere in Nextflow, and usage of Wave is not feasible if your infrastructure has other considerations that are not addressed by the implementation of Wave. So the end result is, if you cannot run Wave, or if Wave is not available, you just cannot use this Nextflow feature. This does not seem like a good position to be in, as a pipeline developer.

I think a better solution for this, would be that Nextflow just uses the same behaviour for module binaries as it does for the workflow's bin dir, cache it in S3 during execution and pull it back down locally on the job's EC2 and add it as an extra location in PATH. Its not really clear to me why this is not already the default behavior. Maybe we can get this behavior enabled as an option instead of forcing Wave usage?

pditommaso · 2024-12-13T16:04:36Z

pditommaso
Dec 13, 2024
Maintainer

I understand this can be seen an unnecessary limitation, however are technical reasons why it's not supported.

First of all some context. Module binaries is an opt-in feature because it's not generally supported by all executors. Therefore we requires the users to enable it explicitly to be aware of the exiting limitations.

Currently it's supported by 1) local and grid executors and 2) cloud executors via Wave. The reason for this relies in the underlying storage used by each executor.

In the first case, since those "binary" accessible either in the local or shared file system, it's enough adding those path to the task PATH to make them available.

When using a cloud executor, things are more complicated because the lack of a shared file system would require to upload those scripts in the object storage and then download in the container somewhere accessible to the task PATH.

You would argue that is what's done for workflow level bin/ directly, then why it's not the same for module?

One reason is that in case of workflow binaries, being a single, predictable path, it's fairly simple to upload to a bucket, and then download in the tasks. Having many modules each of them with its own binaries, would required to handle different namespaces and some synchronization strategy to avoid multiple runs of the same process to upload the same binaries multiple times, etc. Not impossible, but unneeded complexity having a better alternative.

Another reason is that the ability to share the workflow bin/ directory comes from the early days of nextflow, in which uploading to the object storage and downloading them using aws command installed in the VM was the only solution viable.

However this approach, it's extremely fragile, hard to configure, error-prone, sub-optimal in term of performance and an hassle to maintain. Without counting that it requires a different ugly glue code, like the one you are reported, for each different cloud vendor.

This the reason we decided to implement the support for modules binaries for cloud executors using Wave. This approach allows us to rely on the augmentation capabilities to include the tasks binaries directly in the container, with the following advantages:

general solution, work with any cloud out-of-the-box;
no need to manage race-conditions uploading the script, transparently managed by wave;
no need for CLI tool to be pre-installed in the VM or container;
no messy command line wrangling to copy those files.

Yes, requires an extra service, however it's provided OSS and free of charge to any Platform user or it cab even be installed on-prem. I think it's a fair assumption relying on a web-service when working with cloud and containers, otherwise not even docker.io should be an option.

Last but not least, we consider the access of S3 files in nextflow batch jobs via command line (similarly for other cloud) a legacy solution for the reasons mentioned (unreliable, hard to configure and maintain, etc) and plan to converge more over the years to service-based solutions in which Wave and Fusion are central components to deploy pipelines at scale in the (any) cloud.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module binaries should not require Wave #5609

{{title}}

Replies: 1 comment

{{title}}

Select a reply

module binaries should not require Wave #5609

stevekm Dec 4, 2024

Replies: 1 comment

pditommaso Dec 13, 2024 Maintainer

stevekm
Dec 4, 2024

pditommaso
Dec 13, 2024
Maintainer