Failure when using more than 1 GPU in STRUMPACK MPI #126

jinghu4 · 2024-11-28T02:19:02Z

Hi, Dr. Ghysels,

I have seen some issues when using multi-GPU feature of STRUMPACK to solve a sparse matrix. I built STRUMPACK successfully with support of SLATE and MAGMA.

When I run the test cases in STRUMPACK, "make test", the sparse_mpi and reuse_structure_mpi both failed.

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.178864 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
CUDA assertion failed: invalid resource handle ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cu 114
[gpu01:2817703] *** Process received signal ***

However, it passes when I run with one GPU: "
OMP_NUM_THREADS=1 mpirun -n 1 test_structure_reuse_mpi pde900.mtx

Random failure when solving a sparse matrix with STRUMPACK multi-gpu
Example: I try using 2 GPUs:

mpirun -n 2 --mca pml ucx myApplication.exe

a) sometimes it passes

OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
# DenseMPI factorization complete, GPU=1, P=2, T=10: 0.170223 seconds, 0.00550864 GFLOPS, 0.0323613 GFLOP/s,  ds=203, du=0

(Why GPU =1 here? Does it mean, it only use one GPU but two processes are run on each og gpus I request? )

b) sometimes it fails with error msg

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 23.5596 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
cuSOLVER assertion failed: 6 ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cpp 614
CUSOLVER_STATUS_EXECUTION_FAILED

Do you know what the reasons could be, causing these issues and how should I resolve them?

Best,
-Jing

The text was updated successfully, but these errors were encountered:

pghysels · 2024-11-28T03:53:36Z

The GPU =1 means that GPU is enabled, otherwise it would be GPU =0.
Sorry that is confusing, I will fix that.

The OMP deprecation message is probably coming from the SLATE library.

I believe the invalid resource handle message is because multiple mpi processes are using the same GPU, and so it is using more CUDA streams than allowed per GPU.

pghysels · 2024-11-28T04:01:38Z

This changes the GPU =1 to GPU enabled:
115b152

pghysels · 2024-11-28T04:03:39Z

When you run with P mpi ranks on a machine with D GPUs, mpi rank p will use device d = p % D.

jinghu4 · 2024-11-28T04:09:03Z

Yes. But what I have confused is that we I run

mpirun -n 2 myApplication

All 8 gpus on the node run has these two processes Id running.

Even when I use cudaSetDevice to assign rank 0 to gpu 0 and rank 1 to gpu 1.
I can still see two processes running on both rank 0 and rank1.

pghysels · 2024-11-28T04:28:15Z

Hmm, I'm not sure.
STRUMPACK calls cudaSetDevice , see here

STRUMPACK/src/dense/CUDAWrapper.cpp

Line 330 in 115b152

cudaSetDevice(rank % devs);

this is called form the SparseSolver constructor. So perhaps that changes what you specify. But it should not use all GPUs. Maybe SLATE is doing that?

You could try to set the CUDA_VISIBLE_DEVICES environment variable. But you need to set it differently for each MPI rank.
You can do that by setting it in a small script, which you then run using mpirun, as explained here:
https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232

jinghu4 · 2024-12-10T00:22:17Z

Hi, Dr. Ghysels,

Thank you for your reply!

The previous failure when using muti-GPU STRUMPACK with slate has been avoided by setting OMP_NUM_THREADS= 1 as an enviroment input when calling mpirun.
I am not sure if this should be the right solution for this issue, but at least, the failure is gone and STRUMPACK can solve Ax=b with multi-GPUs in my application.

I also used CUDA_VISIBLE_DEVICES to limit the GPU resource. Now, the STRUMPACK only has MPI tasks assign on these Devices. eg: mpi -n 2, then two mpi tasks both use the two GPUs, 0 and 1.

Following are a couple of questions regarding performance:

When using Multi-GPU, is the solve phase done on CPU or GPU? Is MAGMA still used in Multi-GPU version?
StrumpackSparseMPIDist uses block-row distributed compressed sparse row matrix as input. Does it has influence to performance if I partition matrix differently to each process? Should I do a matrix partition to make denser block rows before assign them to each process?
I have noticed weakly scaling when using 2GPUs, 4 GPUs and 8 GPUs .... From your experience, STRUMPACK run faster with more GPUs?
Are there some parameters that I can tune for my application to improve performance. I have noticed slow factorization and solve.
I have noticed a lot of data movement between host and device. Are there options to decrease these data movement.

Thank you very much! Looking forward to your reply.

Best,
-Jing

pghysels · 2024-12-10T17:43:03Z

Are you using MPI_Init_thread with MPI_THREAD_MULTIPLE ? This is required for SLATE.

MAGMA is still used for the factorization in the multi-GPU setting, but only for the local subtrees, while other parts of the code use SLATE.
However, the solve phase does not not use MAGMA when running with multiple MPI ranks.

Indeed scaling with multiple GPUs is not very good. The problem really needs to be big enough.

You can try running with --sp_enable_METIS_NodeNDP (see also #127). This can lead to better performance, and better scaling.

There is not much to be done about the data movement for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure when using more than 1 GPU in STRUMPACK MPI #126

Failure when using more than 1 GPU in STRUMPACK MPI #126

jinghu4 commented Nov 28, 2024

pghysels commented Nov 28, 2024 •

edited

Loading

pghysels commented Nov 28, 2024

pghysels commented Nov 28, 2024

jinghu4 commented Nov 28, 2024

pghysels commented Nov 28, 2024

jinghu4 commented Dec 10, 2024

pghysels commented Dec 10, 2024

Failure when using more than 1 GPU in STRUMPACK MPI #126

Failure when using more than 1 GPU in STRUMPACK MPI #126

Comments

jinghu4 commented Nov 28, 2024

pghysels commented Nov 28, 2024 • edited Loading

pghysels commented Nov 28, 2024

pghysels commented Nov 28, 2024

jinghu4 commented Nov 28, 2024

pghysels commented Nov 28, 2024

jinghu4 commented Dec 10, 2024

pghysels commented Dec 10, 2024

pghysels commented Nov 28, 2024 •

edited

Loading