Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hpcrun hangs, please assist #635

Open
mfbarad opened this issue Oct 31, 2022 · 5 comments
Open

hpcrun hangs, please assist #635

mfbarad opened this issue Oct 31, 2022 · 5 comments
Assignees

Comments

@mfbarad
Copy link

mfbarad commented Oct 31, 2022

When I run our executable the code gets stuck on startup. Our code is dynamically linked and uses mpi. I have tried running it various ways and it still hangs, the simplest version is as follows:

hpcrun app inputfile

Also hanging:
hpcrun -t -e PAPI_TOT_CYC app inputfile
mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC app inputfile

I put a std::cout << "debug" << std::endl first thing in main and that never shows up. In another terminal when I do top, the code is using 100% of the resources per core, so it seems to be doing something. Is this just a matter of us not waiting long enough? Not sure how long to let it run for as there is no indicator of progress.

When I do "hpcrun ls" it does not hang and seems to produce something usable.

I built hpctoolkit using spack following your install directions. We are on TOSS3 / RHEL7.

We are new to hpctoolkit so likely we are making a simple mistake.
Thanks,
Mike

@jmellorcrummey
Copy link
Member

In general, HPCToolkit supports measuring dynamically-linked MPI applications.

If when you say that

hpcrun app inputfile

fails, do you mean with an application that is compiled with OpenMPI that should self-launch when run without an MPI launcher? If so, that is a known issue. You should use an MPI launcher to get around the problem.

mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC app inputfile

This should work. Are you running the 2022.10 release of HPCToolkit or not? Does your MPI application use GPUs or not?

@jmellorcrummey jmellorcrummey self-assigned this Nov 1, 2022
@mfbarad
Copy link
Author

mfbarad commented Nov 1, 2022

Hi John,

$ hpcrun --version
hpcrun: A member of HPCToolkit, version 2022.10.01-release
git branch: unknown (not a git repo)
spack spec: [email protected]%[email protected]craycudadebuglevel_zerompiopencl+papi~rocm+viewer build_system=autotools arch=linux-rhel7-x86_64/x4hfzn4
install dir: /swbuild/mbarad/LAVA_GROUP/LAVA_deps/spack/linux-rhel7-x86_64/gcc-10.2.0/hpctoolkit-2022.10.01-x4hfzn4wjy67u36chwneyciibnimekge

This version of the app is CPU only (no GPUs). The MPI is HPE MPT, on NASA's Pleiades supercomputer.

Thanks for your help,
Mike

@mfbarad
Copy link
Author

mfbarad commented Nov 7, 2022

Is there anything else that I can do to help figure this out? It would be great to get it working. We have a bunch of NASA users who will benefit from this. Thanks

@jmellorcrummey
Copy link
Member

jmellorcrummey commented Nov 7, 2022

Can you give us a backtrace from a hanging process? Attach to one of your MPI ranks with gdb and then ask for a backtrace using the backtrace command.

That will give us a sense of what is happening and hopefully help us understand how to fix the problem.

You might try using a trivial MPI program instead of your real application to see if that also causes the hang.

We have some simple regression tests for this purpose.

git clone https://github.com/hpctoolkit/hpctoolkit-tests
cd hpctoolkit-tests/applications/loop-suite/5.loop-mpi-cputime

make

If you have an mpicc in your path, this will build and attempt to run the binary. You may need to launch the binary yourself on the compute node with

mpiexec -perhost 2 hpcrun -t -e CPUTIME ./loop

If that works, you can also try

mpiexec -perhost 2 hpcrun -t -e cycles ./loop

and

mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC ./loop

@jmellorcrummey
Copy link
Member

jmellorcrummey commented Nov 7, 2022

I'll note that we had some trouble with HPE MPI before. That led us to write the following https://bit.ly/glibc-ldaudit and engage with Red Hat to fix a Linux monitoring interface we need that has been broken forever.

If you look at the motivation, you'll see that our intro complains about what HPE's SGI MPI does. That may be related to your trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants