You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.
When dynamic MPI support is enabled, we build lincorenrnmpi_.so library. If this library is built with OpenACC flags (e.g. -acc) then program crashes at the exit handler:
salloc --account=proj16 --partition=prod_p2 --time=08:00:00 --nodes=1 --constraint=v100 --gres=gpu:4 -n 40 --mem 0 --exclusivemodule purgemodule load unstable nvhpc/21.2 hpe-mpi cuda cmakegit clone --depth 1 [email protected]:neuronsimulator/nrn.gitgit clone --depth 1 [email protected]:BlueBrain/CoreNeuron.gitcd CoreNeuron && mkdir BUILD && cd BUILDcmake -DCORENRN_ENABLE_DYNAMIC_MPI=ON -DCMAKE_CXX_FLAGS="-acc" -DCMAKE_C_COMPILER=nvc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_CUDA_COMPILER=nvcc .../bin/nrnivmodl-core ../../nrn/test/coreneuron/mod/srun -n 1 ./x86_64/special-core --mpi -d ../coreneuron/tests/integration/ring.........Solver Time : 0.0748029 Simulation Statistics Number of cells: 5 Number of compartments: 115 Number of presyns: 28 Number of input presyns: 0 Number of synapses: 15 Number of point processes: 38 Number of transfer sources: 0 Number of transfer targets: 0 Number of spikes: 9 Number of spikes with non negative gid-s: 9CoreNEURON run........MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11). Process ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn MPT Version: HPE HMPT 2.22 03/31/20 16:17:35MPT: --------stack traceback-------MPT: Attaching to program: /proc/33265/exe, process 33265MPT: [New LWP 33310]MPT: [New LWP 33309]MPT: [New LWP 33283]MPT: [Thread debugging using libthread_db enabled]MPT: Using host libthread_db library "/lib64/libthread_db.so.1".MPT: (no debugging symbols found)...done.....MPT: done.MPT: 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0MPT: Missing separate debuginfos, use: debuginfo-install bbp-nvidia-driver-470.57.02-2.x86_64 glibc-2.17-324.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libibverbs-54mlnx1-1.54103.x86_64 libnl3-3.2.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 nss-softokn-freebl-3.53.1-6.el7_9.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64MPT: (gdb) #0 0x00002aaaad9961d9 in waitpid () from /lib64/libpthread.so.0MPT: #1 0x00002aaab216a3e6 in mpi_sgi_system (MPT: #2 MPI_SGI_stacktraceback (MPT: header=header@entry=0x7fffffff67d0 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 33265, Host: ldir01u09.bbp.epfl.ch, Program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/x86_64/special.nrn\n\tMPT Version: HPE HMPT 2.22 03/31/2"...) at sig.c:340MPT: #3 0x00002aaab216a5d8 in first_arriver_handler (signo=signo@entry=11,MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab33e0080) at sig.c:489MPT: #4 0x00002aaab216a8b3 in slave_sig_handler (signo=11,MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565MPT: #5 <signal handler called>MPT: #6 0x00002aaaabcc2cd2 in ?? ()MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0MPT: #7 0x00002aaaabcc6614 in ?? ()MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0MPT: #8 0x00002aaaabcb61bc in ?? ()MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0MPT: #9 0x00002aaaabcb7cdb in ?? ()MPT: from /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/cuda/11.0/lib64/libcudart.so.11.0MPT: #10 0x00002aaaab984da7 in __pgi_uacc_cuda_unregister_fat_binary (MPT: pgi_cuda_loc=0x2aaaaacb5a40 <__PGI_CUDA_LOC>) at ../../src/cuda_init.c:649MPT: #11 0x00002aaaab984d46 in __pgi_uacc_cuda_unregister_fat_binaries ()MPT: at ../../src/cuda_init.c:635MPT: #12 0x00002aaaae553ce9 in __run_exit_handlers () from /lib64/libc.so.6MPT: #13 0x00002aaaae553d37 in exit () from /lib64/libc.so.6MPT: #14 0x00002aaaab15b264 in hoc_quit () at /root/nrn/src/oc/hoc.cpp:1177MPT: #15 0x00002aaaab1425f4 in hoc_call () at /root/nrn/src/oc/code.cpp:1389MPT: #16 0x00002aaab3f7747e in _INTERNAL_37__root_nrn_src_nrnpython_nrnpy_hoc_cpp_629d835d::fcall () at /root/nrn/src/nrnpython/nrnpy_hoc.cpp:692MPT: #17 0x00002aaaab0ddf35 in OcJump::fpycall ()MPT: at /root/nrn/src/nrniv/../ivoc/ocjump.cpp:222
To Reproduce
See the instructions above
Expected behavior
With or without -acc flag, shared library should work fine.
System (please complete the following information)
System/OS: BB5
Compiler: NVHPC 21.2
Version: master, just add -acc flag to mpi library as well
Backend: GPU
Additional context
We should provide a small reproducer to NVIDIA dev forum.
The text was updated successfully, but these errors were encountered:
Describe the issue
When dynamic MPI support is enabled, we build lincorenrnmpi_.so library. If this library is built with OpenACC flags (e.g.
-acc
) then program crashes at the exit handler:To Reproduce
See the instructions above
Expected behavior
With or without -acc flag, shared library should work fine.
System (please complete the following information)
Additional context
We should provide a small reproducer to NVIDIA dev forum.
The text was updated successfully, but these errors were encountered: