-
Notifications
You must be signed in to change notification settings - Fork 39
Extend POINTER transfer to any RANGE variable in a NRN_THREAD #772
Conversation
and for consistency with NEURON, after BEFORE_STEP as well.
The nrn/test/coreneuron/mod/axial.inc file which is used in axial.mod and AxialPP.mod suffers from a GPU or loop vectorized race condition at the statement
where pim is a POINTER to im some other instance of axial or AxialPP (located in the parent compartment).
which surrounds the statement with MUTEXLOCK and MUTEXUNLOCK but that is only for pthreads. Another case is #768. |
Codecov Report
@@ Coverage Diff @@
## master #772 +/- ##
==========================================
- Coverage 56.01% 55.45% -0.57%
==========================================
Files 108 108
Lines 9005 9107 +102
==========================================
+ Hits 5044 5050 +6
- Misses 3961 4057 +96
Continue to review full report at Codecov.
|
I believe merging |
I merged |
Sorry, this needs |
As mentioned in neuronsimulator/nrn#1622 (comment) I'd like to explicitly support the handling of multiple BEFORE SETUP blocks in a single mod file. Although not really relevant to the POINTER topic of this pull request, the new checkpoint test on the NEURON side is the easiest way to test multiple BEFORE SETUP support and that support is likely to require some minor code changes on this CoreNEURON side as well. So unless we can get this PR merged to the master without too much delay, I can make the changes here ... |
@nrnhines : I am going to review & fix the failing tests under gitlab today. May be better to start a new branch from this branch? |
@nrnhines : Gitlab CI failing with following error. Is this expected? Otherwise I will take a look:
|
That is not supposed to fail. Seems like stderr and stdout needs to be printed. |
Ok thanks. On my local machine I am also not able to reproduce. I will check the CI build tomorrow morning. |
@pramodk My attempt to get more information with neuronsimulator/nrn@f7be90f wasn't fruitful as there was no stderr/stdout text. So all we know is that there is a segfault. |
Sorry for delay @nrnhines! Didn't get time earlier today to look into this. I didn't debug thoroughly but at least quickly able to reproduce the issue by using binaries + datasets created in CI. It seems like related to our reportinlibg library linked to CoreNEURON. Here is what I did: # copy failed test directory
cp -r /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/J179059/spack-build/spack-stage-neuron-develop-47kwkdqbfahzhd6mgo3d7lvidjenlcwo/spack-build-47kwkdq/test/coreneuron_modtests/test_pointer_py_cpu .
cd test_pointer_py_cpu/
# for testing, allocate some cpus
$ salloc -A proj16 -N 1 --constraint=cpu -n 2 -p prod
# see segfault
kumbhar@r1i7n20:~/tmp/test_pointer_py_cpu$ ./x86_64/special-core -d coredat/
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 c4f1e5bc (2022-03-07 16:58:53 +0100)
Additional mechanisms from files
axial.mod axial_pp.mod bacur.mod banocur.mod exp2syn.mod expsyn.mod fornetcon.mod hh.mod invlfire.mod natrans.mod netmove.mod netstim.mod passive.mod pattern.mod sample.mod stim.mod svclmp.mod unitstest.mod watchrange.mod
Memory (MBs) : After mk_mech : Max 10.0000, Min 10.0000, Avg 10.0000
Memory (MBs) : After MPI_Init : Max 10.0000, Min 10.0000, Avg 10.0000
Memory (MBs) : Before nrn_setup : Max 10.1172, Min 10.1172, Avg 10.1172
Setup Done : 0.00 seconds
Model size : 35.18 kB
Memory (MBs) : After nrn_setup : Max 10.4375, Min 10.4375, Avg 10.4375
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=false
--dt=0.025
--tstop=100
GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=coredat/
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 10.4375, Min 10.4375, Avg 10.4375
Memory (MBs) : After nrn_finitialize : Max 10.4375, Min 10.4375, Avg 10.4375
Segmentation fault
# gdb says it's report related
$ gdb --args ./x86_64/special-core -d coredat/
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/J179059/spack-build/spack-stage-neuron-develop-47kwkdqbfahzhd6mgo3d7lvidjenlcwo/spack-build-47kwkdq/test/nrnivmodl/f37a8662f1006c013843754879ab3cc44ed227d607809d6e2bc1806460d64447/x86_64/special-core...done.
(gdb) r
Starting program: /gpfs/bbp.cscs.ch/home/kumbhar/tmp/test_pointer_py_cpu/./x86_64/special-core -d coredat/
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_applications/install_intel-2021.4.0-skylake/libsonata-report-1.1-nfrzrl/lib/libsonatareport.so]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libstdc++.so.6]
Dwarf Error: wrong version in compilation unit header (is 5, should be 2, 3, or 4) [in module /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libgcc_s.so.1]
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 c4f1e5bc (2022-03-07 16:58:53 +0100)
...
Memory (MBs) : After nrn_finitialize : Max 10.4453, Min 10.4453, Avg 10.4453
Program received signal SIGSEGV, Segmentation fault.
MPI_SGI_comm_rank (comm=1140850688) at ../../../../include/comm.h:216
216 ../../../../include/comm.h: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64
(gdb) bt
#0 MPI_SGI_comm_rank (comm=1140850688) at ../../../../include/comm.h:216
#1 PMPI_Comm_rank (comm=1140850688, rank=0x7fffffff4240) at comm_rank.c:93
#2 0x00007fffeda707ec in AllReports::makeGlobalCommunicator (this=0x7fffedb05908 <_rtld_local+2312>) at /nvme/bbpcihpcdeploy/160693/spack-stage/spack-stage-reportinglib-2.5.6-gdhqypawxbwwjg2iq3g6gd6r6q3civat/spack-src/reportinglib/AllReports.cpp:474
#3 0x00007fffed7591c6 in coreneuron::setup_report_engine (dt_report=2147483647, mindelay=10) at ../spack-src/coreneuron/io/reports/nrnreport.cpp:57
#4 0x00007fffed6d0078 in run_solve_core (argc=3, argv=0x7fffffff45f8) at ../spack-src/coreneuron/apps/main1.cpp:609
#5 0x00007fffedadf702 in solve_core (argc=3, argv=0x7fffffff45f8) at ../../../../../../../software/install_intel-2021.4.0-skylake/coreneuron-develop-hsyhen/share/coreneuron/enginemech.cpp:49
#6 0x0000000000403293 in main (argc=3, argv=0x7fffffff45f8) at /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/software/install_intel-2021.4.0-skylake/coreneuron-develop-hsyhen/share/coreneuron/coreneuron.cpp:14
(gdb) quit
# enabling MPI doesn't solve the issue completely
$ srun -n 1 ./x86_64/special-core -d coredat/ --mpi
num_mpi=1
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 c4f1e5bc (2022-03-07 16:58:53 +0100)
Additional mechanisms from files
axial.mod axial_pp.mod bacur.mod banocur.mod exp2syn.mod expsyn.mod fornetcon.mod hh.mod invlfire.mod natrans.mod netmove.mod netstim.mod passive.mod pattern.mod sample.mod stim.mod svclmp.mod unitstest.mod watchrange.mod
Memory (MBs) : After mk_mech : Max 12.5664, Min 12.5664, Avg 12.5664
Memory (MBs) : After MPI_Init : Max 12.5664, Min 12.5664, Avg 12.5664
Memory (MBs) : Before nrn_setup : Max 12.8008, Min 12.8008, Avg 12.8008
Setup Done : 0.00 seconds
Model size : 35.18 kB
Memory (MBs) : After nrn_setup : Max 13.1133, Min 13.1133, Avg 13.1133
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=false
--dt=0.025
--tstop=100
GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=coredat/
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 13.1133, Min 13.1133, Avg 13.1133
Memory (MBs) : After nrn_finitialize : Max 13.1133, Min 13.1133, Avg 13.1133
[REPORTS] [info] :: Initializing PARALLEL implementation...
psolve |=========================================================| t: 100.00 ETA: 0h00m01s
Solver Time : 0.127889
Simulation Statistics
Number of cells: 5
Number of compartments: 163
Number of presyns: 46
Number of input presyns: 0
Number of synapses: 0
Number of point processes: 46
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 330
Number of spikes with non negative gid-s: 330
terminate called after throwing an instance of 'std::runtime_error'
what(): Error: node_id is 0 and input data is reported as 1-based
MPT ERROR: Rank 0(g:0) received signal SIGABRT/SIGIOT(6).
Process ID: 259074, Host: r1i7n20, Program: /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/J179059/spack-build/spack-stage-neuron-develop-47kwkdqbfahzhd6mgo3d7lvidjenlcwo/spack-build-47kwkdq/test/nrnivmodl/f37a8662f1006c013843754879ab3cc44ed227d607809d6e2bc1806460d64447/x86_64/special-core
MPT Version: HPE HMPT 2.25 10/22/21 03:18:39
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/259074/exe, process 259074
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab17b1d9 in waitpid () from /lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: debuginfo-install glibc-2.17-325.el7_9.x86_64 libibverbs-54mlnx1-1.54103.x86_64 libnl3-3.2.28-4.el7.x86_64
MPT: (gdb) #0 0x00002aaaab17b1d9 in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaaab4be566 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffff2ad0 "MPT ERROR: Rank 0(g:0) received signal SIGABRT/SIGIOT(6).\n\tProcess ID: 259074, Host: r1i7n20, Program: /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/J179059/spack-build/spack-stage-neuro"...) at sig.c:340
MPT: #3 0x00002aaaab4be75f in first_arriver_handler (signo=signo@entry=6,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab09a0080) at sig.c:489
MPT: #4 0x00002aaaab4bea33 in slave_sig_handler (signo=6, siginfo=<optimized out>,
MPT: extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x00002aaaabd05387 in raise () from /lib64/libc.so.6
MPT: #7 0x00002aaaabd06a78 in abort () from /lib64/libc.so.6
MPT: #8 0x00002aaaab85e88a in __gnu_cxx::__verbose_terminate_handler() [clone .cold] ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libstdc++.so.6
MPT: #9 0x00002aaaab86a2fa in __cxxabiv1::__terminate(void (*)()) ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libstdc++.so.6
MPT: #10 0x00002aaaab86a365 in std::terminate() ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libstdc++.so.6
MPT: #11 0x00002aaaab86a5f9 in __cxa_throw ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_compilers/install_gcc-4.8.5-haswell/gcc-11.2.0-suikmu/lib64/libstdc++.so.6
MPT: #12 0x00002aaaaac096f1 in bbp::sonata::SonataData::convert_gids_to_sonata(std::vector<unsigned long, std::allocator<unsigned long> >&, unsigned long) ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_applications/install_intel-2021.4.0-skylake/libsonata-report-1.1-nfrzrl/lib/libsonatareport.so
MPT: #13 0x00002aaaaac09ecb in bbp::sonata::SonataData::write_spikes_header(bbp::sonata::Population&) ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_applications/install_intel-2021.4.0-skylake/libsonata-report-1.1-nfrzrl/lib/libsonatareport.so
MPT: #14 0x00002aaaaac0973a in bbp::sonata::SonataData::write_spike_populations() ()
MPT: from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_applications/install_intel-2021.4.0-skylake/libsonata-report-1.1-nfrzrl/lib/libsonatareport.so
MPT: #15 0x00002aaaaafb43b1 in _INTERNALb49cba43::coreneuron::output_spikes_parallel
MPT: (outpath=0x7fffffff43d0 ".", filename=0x2aaaab073968 "out",
MPT: population_name_offset=std::vector of length 0, capacity 0)
MPT: at ../spack-src/coreneuron/io/output_spikes.cpp:216
MPT: #16 0x00002aaaaafb4b1f in coreneuron::output_spikes (
MPT: outpath=0x7fffffff43d0 ".",
MPT: population_name_offset=std::vector of length 0, capacity 0)
MPT: at ../spack-src/coreneuron/io/output_spikes.cpp:292
MPT: #17 0x00002aaaaaf59237 in run_solve_core (argc=4, argv=0x7fffffff4608)
MPT: at ../spack-src/coreneuron/apps/main1.cpp:648
MPT: #18 0x00002aaaaaadc702 in solve_core (argc=4, argv=0x7fffffff4608)
MPT: at ../../../../../../../software/install_intel-2021.4.0-skylake/coreneuron-develop-hsyhen/share/coreneuron/enginemech.cpp:49
MPT: #19 0x0000000000403293 in main (argc=4, argv=0x7fffffff4608)
MPT: at /gpfs/bbp.cscs.ch/ssd/gitlab_map_jobs/bbpcihpcproj12/P41880/software/install_intel-2021.4.0-skylake/coreneuron-develop-hsyhen/share/coreneuron/coreneuron.cpp:14
MPT: (gdb) A debugging session is active.
MPT: We know now where to look at. So tomorrow we should be able to track this on our side better. cc: @jorblancoa @olupton |
Doesn't seem like a segfault. However maybe "5" is numerologically significant as this PR bumps the write version from 1.4 to 1.5 :) |
I think this warning/error is not relevant, it's complaining because GDB version is too old and doesn't support DWARF used in the binary. Loading newer GDB module removes this message. |
I think this is related to neuronsimulator/nrn#1619 |
hooray! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't have major comments but reading code related permutations aspects reminds how critical is to simplify the implementation with same data structures between NEURON and CoreNEURON. CoreNEURON should just do compute aspects...!
After 8.1 release, we should definitely revive our summer discussions and continue on major refactoring aspects that we were discussing (including C++ migration PRs like neuronsimulator/nrn/pull/1597).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just a single suggestion
…ain/CoreNeuron#772) * Extend POINTER from voltage to any RANGE variable. * trajectory recording is after AFTER_SOLVE and for consistency with NEURON, after BEFORE_STEP as well. * update coreneuron ringtest integration data to version 1.5 * Handle the checkpoint for the POINTER * Initialize reporting interface only if there are reports CoreNEURON Repo SHA: BlueBrain/CoreNeuron@f72026d
Prior to this, POINTER was restricted to point to voltage.
This change depends on neuronsimulator/nrn#1622
Requires bbcore_write_version 1.5
The added test on the NEURON side requires merge of #748
Many CI tests fail because file mode test data has not been updated to bbcore_write_version 1.5. I need help or instructions on how to update that test data. Edit: was updated.
See neuronsimulator/nrn for test.
CI_BRANCHES:NEURON_BRANCH=hines/POINTER-to-RANGE,