Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

JackAKirk · 2024-10-08T13:22:46Z

To be precise is the following code example that maps two gpus to a single MPI rank valid, or is it Undefined Behavior in the standard cuda-aware MPI?

In particular note the part

  int otherRank = myrank == 0 ? 1 : 0;
  cudaSetDevice(otherRank);
  cudaSetDevice(myrank);

Where I set a cuda device that I don't use in that rank before correctly setting the device that I want to map the the MPI rank for MPI invocations. According to e.g. these docs https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#when-do-i-need-to-select-a-cuda-device
nothing is specified that tells me this is invalid.

Here is the full example:

#include <cuda_runtime.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int myrank;
  float *val_device, *val_host;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  int otherRank = myrank == 0 ? 1 : 0;
  cudaSetDevice(otherRank);
  cudaSetDevice(myrank);
  int num = 1000000;
  val_host = (float *)malloc(sizeof(float) * num);
  cudaMalloc((void **)&val_device, sizeof(float) * num);
  for (int i = 0; i < 1; i++) {
    *val_host = -1.0;
    if (myrank != 0) {
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and my initial value is:", *val_host);
    }

    if (myrank == 0) {
      *val_host = 42.0;
      cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and will broadcast value:", *val_host);
    }

    MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

    if (myrank != 0) {
      cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and received broadcasted value:", *val_host);
    }
  }
  cudaFree(val_device);
  free(val_host);

  MPI_Finalize();
  return 0;
}

Then execute the above using two MPI ranks on a node with at least two gpus available.

It seems that the above example leads to data leaks for all cuda-aware MPI implementations (cray-mpi, OPENMPI/ MPICH, built with UCX or openfabric (I think, but not certain)), and this isn't specific to e.g. MPI_Bcast, using MPI_Send etc also has the same effect.
I think it might be implied by a few things that I've read that standard MPI does not support multiple gpus to a single rank, and that things like MPI endpoints or nccl (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread) might be used in such a case.
However I cannot find anything explicitly stating this, so it would be good if someone could confirm this?
If not then the above reproducer is a bug report for this case.

Thanks

The text was updated successfully, but these errors were encountered:

JackAKirk mentioned this issue Oct 8, 2024

[RFC]: C API oneapi-src/oneCCL#134

Merged

JackAKirk mentioned this issue Nov 1, 2024

[CUDA][HIP] too many process spawned on multiple GPU systems intel/llvm#15251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

JackAKirk commented Oct 8, 2024 •

edited

Loading

Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

Comments

JackAKirk commented Oct 8, 2024 • edited Loading

JackAKirk commented Oct 8, 2024 •

edited

Loading