-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hpcrun hangs, please assist #635
Comments
In general, HPCToolkit supports measuring dynamically-linked MPI applications. If when you say that
fails, do you mean with an application that is compiled with OpenMPI that should self-launch when run without an MPI launcher? If so, that is a known issue. You should use an MPI launcher to get around the problem.
This should work. Are you running the 2022.10 release of HPCToolkit or not? Does your MPI application use GPUs or not? |
Hi John, $ hpcrun --version This version of the app is CPU only (no GPUs). The MPI is HPE MPT, on NASA's Pleiades supercomputer. Thanks for your help, |
Is there anything else that I can do to help figure this out? It would be great to get it working. We have a bunch of NASA users who will benefit from this. Thanks |
Can you give us a backtrace from a hanging process? Attach to one of your MPI ranks with gdb and then ask for a backtrace using the backtrace command. That will give us a sense of what is happening and hopefully help us understand how to fix the problem. You might try using a trivial MPI program instead of your real application to see if that also causes the hang. We have some simple regression tests for this purpose. git clone https://github.com/hpctoolkit/hpctoolkit-tests make If you have an mpicc in your path, this will build and attempt to run the binary. You may need to launch the binary yourself on the compute node with mpiexec -perhost 2 hpcrun -t -e CPUTIME ./loop If that works, you can also try mpiexec -perhost 2 hpcrun -t -e cycles ./loop and mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC ./loop |
I'll note that we had some trouble with HPE MPI before. That led us to write the following https://bit.ly/glibc-ldaudit and engage with Red Hat to fix a Linux monitoring interface we need that has been broken forever. If you look at the motivation, you'll see that our intro complains about what HPE's SGI MPI does. That may be related to your trouble. |
When I run our executable the code gets stuck on startup. Our code is dynamically linked and uses mpi. I have tried running it various ways and it still hangs, the simplest version is as follows:
hpcrun app inputfile
Also hanging:
hpcrun -t -e PAPI_TOT_CYC app inputfile
mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC app inputfile
I put a std::cout << "debug" << std::endl first thing in main and that never shows up. In another terminal when I do top, the code is using 100% of the resources per core, so it seems to be doing something. Is this just a matter of us not waiting long enough? Not sure how long to let it run for as there is no indicator of progress.
When I do "hpcrun ls" it does not hang and seems to produce something usable.
I built hpctoolkit using spack following your install directions. We are on TOSS3 / RHEL7.
We are new to hpctoolkit so likely we are making a simple mistake.
Thanks,
Mike
The text was updated successfully, but these errors were encountered: