Ability to dedicate accelerator directive (GPU) over range #5570

adamrtalbot · 2024-12-04T12:13:10Z

New feature

When submitting to nodes with >1 GPU, Nextflow has very limited capabilities to split the work over separate GPUs.

Usage scenario

Let's imagine running on AWS Batch. We submit multiple GPU enabled tasks to the Batch service. AWS allocates them to a single, large instance with multiple GPUs, as per it's allocation strategy which prioritises the cheapest per CPU price.

In this instance, all tasks would be able to use all GPUs at the same time, leading to collisions and GPU memory issues.

We have some strategies to deal with this:

Use a specific machine size which can only fit a single GPU enabled task
Use environment variables such as NVIDIA_VISIBLE_DEVICES to point each task at a specific GPU
Set maxForks to 1 to ensure only a single task is executed at once

However, we lack a way of saying "for a queue of x GPUs, assign each task to 1 GPU available"

What I want is each task to know which GPU we can use, then only use that GPU.

Suggest implementation

I don't actually have a good fix here. Perhaps using a process array with an index might help? Perhaps it's specific to each executor? But I feel like Nextflow could expose a variable to help us here.

bentsherman · 2024-12-04T16:14:34Z

I think this needs to be addressed by the executor, i.e. AWS Batch or SLURM should allow multiple tasks to request 1 GPU each, pack them onto a multi-GPU node as it would for CPUs, and set the NVIDIA_VISIBLE_DEVICES for each task.

Otherwise Nextflow would basically have to become the executor by tracking the VM assignments for each task in order to figure out which GPUs are available at any given time.

FloWuenne · 2024-12-04T18:54:12Z

@bentsherman SLURM is already doing this I believe? Did some testing yesterday and setting the following in the nextflow.config will tell SLURM to ask for 1 GPU for each task.

process{
   clusterOptions = '--gres=gpu:1'  // Request 1 GPU per task
}

If submitted to a multi-GPU node, different GPUs will be assigned but they will all be mapped to device 0 in each task. So CUDA_VISIBLE_DEVICES will be 0 inside all tasks running in parallel on different GPUs I believe.

bentsherman · 2024-12-04T18:59:27Z

For SLURM it may depend on the individual cluster setup. The sysadmin can (probably) use cgroups to isolate GPUs just like you would for CPUs and memory, so that a job only sees the requested resources even if the underlying node has more.

Setting the CUDA_VISIBLE_DEVICES is also probably something that would have to be configured by the sysadmin, but if cgroups work then you don't have to bother with that environment variable.

bentsherman · 2024-12-04T19:00:42Z

NVIDIA article on cgroups: https://developer.nvidia.com/blog/improving-cuda-initialization-times-using-cgroups-in-certain-scenarios/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to dedicate accelerator directive (GPU) over range #5570

Ability to dedicate accelerator directive (GPU) over range #5570

adamrtalbot commented Dec 4, 2024

bentsherman commented Dec 4, 2024

FloWuenne commented Dec 4, 2024

bentsherman commented Dec 4, 2024

bentsherman commented Dec 4, 2024

Ability to dedicate accelerator directive (GPU) over range #5570

Ability to dedicate accelerator directive (GPU) over range #5570

Comments

adamrtalbot commented Dec 4, 2024

New feature

Usage scenario

Suggest implementation

bentsherman commented Dec 4, 2024

FloWuenne commented Dec 4, 2024

bentsherman commented Dec 4, 2024

bentsherman commented Dec 4, 2024