Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when using more than 1 GPU in STRUMPACK MPI #126

Open
jinghu4 opened this issue Nov 28, 2024 · 7 comments
Open

Failure when using more than 1 GPU in STRUMPACK MPI #126

jinghu4 opened this issue Nov 28, 2024 · 7 comments

Comments

@jinghu4
Copy link

jinghu4 commented Nov 28, 2024

Hi, Dr. Ghysels,

I have seen some issues when using multi-GPU feature of STRUMPACK to solve a sparse matrix. I built STRUMPACK successfully with support of SLATE and MAGMA.

  1. When I run the test cases in STRUMPACK, "make test", the sparse_mpi and reuse_structure_mpi both failed.
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.178864 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
CUDA assertion failed: invalid resource handle ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cu 114
[gpu01:2817703] *** Process received signal ***

However, it passes when I run with one GPU: "
OMP_NUM_THREADS=1 mpirun -n 1 test_structure_reuse_mpi pde900.mtx

  1. Random failure when solving a sparse matrix with STRUMPACK multi-gpu
    Example: I try using 2 GPUs:
mpirun -n 2 --mca pml ucx myApplication.exe

a) sometimes it passes

OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
# DenseMPI factorization complete, GPU=1, P=2, T=10: 0.170223 seconds, 0.00550864 GFLOPS, 0.0323613 GFLOP/s,  ds=203, du=0 

(Why GPU =1 here? Does it mean, it only use one GPU but two processes are run on each og gpus I request? )

b) sometimes it fails with error msg

# multifrontal factorization:
#   - estimated memory usage (exact solver) = 23.5596 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
cuSOLVER assertion failed: 6 ~/STRUMPACK-v8.0.0/STRUMPACK-8.0.0/src/dense/CUDAWrapper.cpp 614
CUSOLVER_STATUS_EXECUTION_FAILED

Do you know what the reasons could be, causing these issues and how should I resolve them?

Best,
-Jing

@pghysels
Copy link
Owner

pghysels commented Nov 28, 2024

The GPU =1 means that GPU is enabled, otherwise it would be GPU =0.
Sorry that is confusing, I will fix that.

The OMP deprecation message is probably coming from the SLATE library.

I believe the invalid resource handle message is because multiple mpi processes are using the same GPU, and so it is using more CUDA streams than allowed per GPU.

@pghysels
Copy link
Owner

This changes the GPU =1 to GPU enabled:
115b152

@pghysels
Copy link
Owner

When you run with P mpi ranks on a machine with D GPUs, mpi rank p will use device d = p % D.

@jinghu4
Copy link
Author

jinghu4 commented Nov 28, 2024

Yes. But what I have confused is that we I run

mpirun -n 2 myApplication

All 8 gpus on the node run has these two processes Id running.

Even when I use cudaSetDevice to assign rank 0 to gpu 0 and rank 1 to gpu 1.
I can still see two processes running on both rank 0 and rank1.

image

@pghysels
Copy link
Owner

Hmm, I'm not sure.
STRUMPACK calls cudaSetDevice , see here

cudaSetDevice(rank % devs);

this is called form the SparseSolver constructor. So perhaps that changes what you specify. But it should not use all GPUs. Maybe SLATE is doing that?

You could try to set the CUDA_VISIBLE_DEVICES environment variable. But you need to set it differently for each MPI rank.
You can do that by setting it in a small script, which you then run using mpirun, as explained here:
https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232

@jinghu4
Copy link
Author

jinghu4 commented Dec 10, 2024

Hi, Dr. Ghysels,

Thank you for your reply!

The previous failure when using muti-GPU STRUMPACK with slate has been avoided by setting OMP_NUM_THREADS= 1 as an enviroment input when calling mpirun.
I am not sure if this should be the right solution for this issue, but at least, the failure is gone and STRUMPACK can solve Ax=b with multi-GPUs in my application.

I also used CUDA_VISIBLE_DEVICES to limit the GPU resource. Now, the STRUMPACK only has MPI tasks assign on these Devices. eg: mpi -n 2, then two mpi tasks both use the two GPUs, 0 and 1.

Following are a couple of questions regarding performance:

  1. When using Multi-GPU, is the solve phase done on CPU or GPU? Is MAGMA still used in Multi-GPU version?
  2. StrumpackSparseMPIDist uses block-row distributed compressed sparse row matrix as input. Does it has influence to performance if I partition matrix differently to each process? Should I do a matrix partition to make denser block rows before assign them to each process?
  3. I have noticed weakly scaling when using 2GPUs, 4 GPUs and 8 GPUs .... From your experience, STRUMPACK run faster with more GPUs?
  4. Are there some parameters that I can tune for my application to improve performance. I have noticed slow factorization and solve.
  5. I have noticed a lot of data movement between host and device. Are there options to decrease these data movement.

Thank you very much! Looking forward to your reply.

Best,
-Jing

@pghysels
Copy link
Owner

Are you using MPI_Init_thread with MPI_THREAD_MULTIPLE ? This is required for SLATE.

MAGMA is still used for the factorization in the multi-GPU setting, but only for the local subtrees, while other parts of the code use SLATE.
However, the solve phase does not not use MAGMA when running with multiple MPI ranks.

Indeed scaling with multiple GPUs is not very good. The problem really needs to be big enough.

You can try running with --sp_enable_METIS_NodeNDP (see also #127). This can lead to better performance, and better scaling.

There is not much to be done about the data movement for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants