You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.
Reproducer
sendrecvgpu.cc
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <sycl/sycl.hpp>
//const int nmesg = 2;
const int nmesg = 16;
//const int nmesg = 24;
//const int nmesg = 32;
//const int nrep = 1;
const int nrep = 1000;
//const int nrep = 10000;
//const int nrep = 20000;
const int nmin = 128*1024;
//const int nmax = 128*1024;
//const int nmin = 256*1024;
const int nmax = 256*1024;
//const int nmin = 2*1024*1024;
//const int nmax = 2*1024*1024;
void sendrecv(double *dest[], double *src[], int n) {
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Request sreq[nmesg], rreq[nmesg];
for(int i=0; i<nmesg; i++) {
int k = 1 << i;
int recv = (rank+k) % size;
MPI_Irecv(dest[i], n, MPI_DOUBLE, recv, i, MPI_COMM_WORLD, &rreq[i]);
}
for(int i=0; i<nmesg; i++) {
int k = 1 << i;
int send = (rank+k*size-k) % size;
MPI_Isend(src[i], n, MPI_DOUBLE, send, i, MPI_COMM_WORLD, &sreq[i]);
}
MPI_Waitall(nmesg, sreq, MPI_STATUS_IGNORE);
MPI_Waitall(nmesg, rreq, MPI_STATUS_IGNORE);
}
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
char name[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI_Get_processor_name(name, &namelen);
//sycl::queue q{sycl::gpu_selector_v};
sycl::platform plat{sycl::gpu_selector_v};
auto devs = plat.get_devices();
int ndev = devs.size();
int devid = rank % ndev;
printf("%s rank %3i device %2i\n", name, rank, devid);
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
sycl::queue q{devs[devid]};
double *src[nmesg], *srcg[nmesg], *dest[nmesg], *destg[nmesg];
for(int i=0; i<nmesg; i++) {
src[i] = (double*)malloc(nmax*sizeof(double));
srcg[i] = (double*)sycl::malloc_device<double>(nmax, q);
dest[i] = (double*)malloc(nmax*sizeof(double));
destg[i] = (double*)sycl::malloc_device<double>(nmax, q);
#pragma omp parallel for
for(int j=0; j<nmax; j++) {
src[i][j] = i + j;
}
}
int error = 0;
int errort = 0;
for(int n=nmin; n<=nmax; n*=2) {
if(rank==0) printf("Testing n = %i ...", n);
for(int rep=0; rep<nrep; rep++) {
//sendrecv(dest, src, n);
for(int i=0; i<nmesg; i++) {
q.memcpy(srcg[i], src[i], n*sizeof(double));
q.memset(destg[i], 0, n*sizeof(double));
}
q.wait();
sendrecv(destg, srcg, n);
for(int i=0; i<nmesg; i++) {
q.memcpy(dest[i], destg[i], n*sizeof(double));
}
q.wait();
for(int i=0; i<nmesg; i++) {
for(int j=0; j<n; j++) {
if (dest[i][j] != src[i][j]) {
printf("\n error %i dest[%i][%i] = %f expected %f\n", rep, i, j, dest[i][j], src[i][j]);
error++;
break;
}
}
if(error>0) break;
}
MPI_Allreduce(&error, &errort, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
if (errort>0) break;
}
if(errort>0) {
if (rank==0) printf(" %i errors.\n", errort);
break;
} else {
if (rank==0) printf(" done.\n");
}
}
MPI_Finalize();
}
I can reproduce this on Aurora with commit d79cd238209c787bbcbbe730f9b958afe4e852acb3480ddfec1d9e98b06783aec97c082eadeca1a7 (updating with test from newer commit) from main as well.
Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR #7182 to fix it.
I confirmed that the reproducer passes for module load mpich/opt/develop-git.204f8cd on Aurora (which includes PR #7182 ). @jcosborn if you have a chance to test out this module, it would be appreciated!
I also confirmed this fixes the reproducer, however I now get hangs for some specific cases when running a full application with pipelining when not setting a larger buffer size. The cases seem to involve messages of different sizes, where some messages are much larger than the rest. I don't know the exact requirements yet and don't have a simple reproducer, but will keep trying to see if I can get one.
We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.
Reproducer
sendrecvgpu.cc
mpicxx -fsycl -qopenmp sendrecvgpu.cc -o sendrecvgpu
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024))
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
mpiexec -np 24 --ppn 12 ./sendrecvgpu
The text was updated successfully, but these errors were encountered: