-
Notifications
You must be signed in to change notification settings - Fork 39
NERSC GPU hackathon (Dec 2021)
This page summarises preparations for the hackathon on the 2nd/8th/9th/10th December 2021. We will use CoreNEURON+NMODL with a few sets of test model data.
NOTE If you see any issues that we need to be handled after hackathon should be created in https://github.com/neuronsimulator/gpuhackathon/issues
-
Olli - Install Caliper, Ninja, NMODL and NMODL's Python dependencies on Ascent.
-
Pramod - Update Ascent instructions to enable NMODL (same as NERSC).
-
Olli - Channel benchmark failing on Ascent during load input dataset. Should we re-generate data on Ascent?
-
Ioannis - Generate & copy input dataset for Solver analysis and optimisation during Hackathon. Add info on this wiki page for the same.
-
Omar - List of all OpenACC directives and APIs used in the CoreNEURON + NMODL. See https://github.com/neuronsimulator/gpuhackathon/blob/main/overview.md
-
Olli - Common profiling/benchmarking script which could compare & plot OpenMP vs OpenACC runtimes
-
First Day Presentation Preparation
- Pramod + Olli: 3 min Introduction
- Ioannis + Christos: Solver introduction to mentors
- Pramod + Omar + Olli: Compute loop + DSL code generation introduction to mentors
- Alex: Check the neuron-gpu-nightly execution on Perlmutter. Do they work with/without MPI?
- Alex: NEURON + CoreNEURON - Interviews and NEURON + CoreNEURON + NMODL - Interviews build on Perlmutter & Ascent. Do they work?
- Alex: Should we automatically disable Interviews on the Cray systems and IBM Power systems by detection via CMake?
- Alex: Check if there is any improvement needed for building from source on Hackathon systems.
- Nico + Alex: Make usage of OpenACC API's (
acc_memcpy_to_device
andacc_copyin
) portable for OpenMP as well as OpenACC. In short https://github.com/BlueBrain/CoreNeuron/blob/master/coreneuron/gpu/nrn_acc_manager.cpp should become OpenACC as well as OpenMP. - Nico + Alex: Test first ringtest and then channel-benchmark for the correctness of the migration
- Pramod: Review OpenACC/OpenMP interoperability via API calls, see e.g. https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866.
- Omar: Add OpenMP offload example like https://bbpcode.epfl.ch/reps/user/kumbhar/cunrnran123/tree/mainacc.c (still using CUDA Random123) -- it moved to GitLab https://bbpgitlab.epfl.ch/hpc/user/cunrnran123
- Omar: Trying Random123 directly in OpenMP offload regions
- Omar: Anything to discuss / ask to mentors about Random123 portability on different gpus?
- Add an option in CoreNEURON that only executes Hines solver ?
- Find out a way to dynamically set a good number for the
nwarp
variable (used to distribute the cells) - Ioannis + Christos: Profiling of a realistic model and analyse different performance metrics with mentors to understand the limitations.
- Ioannis + Christos: With current algorithm, investigate possible performance improvement opportunities
- Ioannis + Christos: Performance comparison of OpenACC vs OpenMP vs CUDA implementation
- Ioannis + Christos: Would benefit from special memory types usage?
- Ioannis + Christos: Look into algorithmic improvements to expose more parallelism (if suitable)
- Olli - Update codegen backend to emit OpenMP constructs with OpenACC (ongoing)
- Olli + Omar - Review with the mentors performance aspects (e.g.
present
vsmap
aspects)
- Build system changes to enable OpenMP target offload as well as OpenACC
- Olli: Try simple OpenMP + OpenACC offload test (e.g. with inbuilt ring test)
- Olli: Update Hines Solver in coreneuron with OpenMP (ongoing)
- - Christos + Ioannis to review: https://github.com/BlueBrain/CoreNeuron/pull/693
- - Confirm if we need an OpenMP equivalent to
#pragma acc loop seq
- Pramod: Measure performance difference between OpenMP and OpenACC runs
- Olli: Under coreneuron event communication part, start tackling OpenACC pragmas or API calls that are not converted to OpenMP
- Pramod: Discuss compiler bug / error with NVIDIA colleague (Brent): https://forums.developer.nvidia.com/t/enabling-openmp-offload-breaks-openacc-code/196643/6
- Pramod: Prepare a slide describing the goals - what currently we do and what we would like to do?
- Pramod: Organise / plan a discussion with a / few nvidia compiler engineers. Central question is - ABI compatibility across different CUDA + OpenMP + OpenACC runtimes/compilers.
- First, check/fix https://github.com/BlueBrain/CoreNeuron/issues/594
- Re-measure the slowdown from using unified memory (ringtest + channel-benchmark)
- Identify which data structures are causing the slowdown. How to identify this easily with the tools?
These are instructions to build + run on Ascent (login1.ascent.olcf.ornl.gov
).
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
Note that Boost is an optional dependency, but it has an apparently-significant impact on both initialisation time and profiling time when using NSight Compute.
# clone repository
cd $HOME
git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron # or: git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron
git checkout 96498142ce60d15edbd4c5161c7153f41eda6b20 #for the mentors
git submodule update -f --init --recursive
mkdir -p build && cd build
cmake .. -G Ninja -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="-gpu=cuda11.0" -DCORENRN_ENABLE_UNIT_TESTS=OFF
cmake --build . --parallel 8
As $HOME is not writable, create your own directory into the project directory:
mkdir -p /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
cd /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
Now run a tiny, functional test on the GPU:
- Allocate a node
bsub -P GEN170 -J neuron -W 90 -nnodes 1 -alloc_flags "gpudefault" -Is $SHELL
Note: gpumps
here would enable MPS, but that is incompatible with the CUDA 11.0 version we are using here.
- Make necessary modules are loaded:
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
- Run simple function test on the GPU:
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi
- How to generate NSight System report with simple ringtest:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-systems/2021.5.1.117
OMP_NUM_THREADS=1 jsrun -E LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/pami_471/libpami.so --gpu_per_rs 1 -n 1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0,CALI_CONFIG=nvtx,OMP_NUM_THREADS=$OMP_NUM_THREADS --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2
- How to generate full NSight Compute report with simple ringtest for a single kernel:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-compute/2021.2.2
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 ncu -k <kernel_name_you_want_to_analyze> -f --profile-from-start=off --target-processes all --set full -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2
For the purposes of the hackathon we have provided two datasets that cover the basic use cases for the coming days.
For running small tests you should use the following data set:
/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank
This data set has a circuit of 1440 cells and is meant to be simulated with 1 MPI rank on 1 GPU.
To run CoreNEURON with this data set you may use the following command:
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 \
jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
--datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank \
--mpi --gpu --cell-permute=2 --tstop=5
For running the simulator to gather profiling or other performance measurements you should use the following data set:
/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks
This data set resembles a production level simulation with a large enough circuit to fill up most of the GPU memory (~14GB). It is meant to be executed on 6 MPI ranks or 6 threads on 1 GPU and consists of 8800 cells. Note that MPS needs to be enabled if you are using pure MPI execution and MPS is not currently working on Ascent with CUDA 11.0, so we need to run this with 1 rank / 6 threads / 1 GPU.
To run CoreNEURON with this data set you may use the following command:
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=6 \
jsrun -n 1 --cpu_per_rs 6 --tasks_per_rs 1 --gpu_per_rs 1 --bind rs $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
--datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
--mpi --gpu --cell-permute=2 --tstop=5
Note: The datasets were generated with 1k synapses/cell and 100 stimuli/cell. Number of synapses and stimuli are fewer than the original model.
- The system-wide modules only go up to NVHPC 21.7, which has known issues with NEURON.
We have ourselves inserted an installation of NVHPC 21.9 (module
nvidia/21.9
) into the Cray Programming Environment setup. - Note that NVHPC 21.9 is configured to use the system GCC 7.5 standard library, while we have built some other dependencies with GCC 9.2. This seems to be "close enough".
-
-tp haswell
or-DR123_USE_SSE=0
is required becausenvc++
defaults to-tp zen
on the Perlmutter nodes, which defines__ABM__
and causes Random123 to try and includeintrin.h
, which fails.
# clone repository
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron
git submodule update -f --init --recursive
mkdir -p build && cd build
# allocate node
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6
# Use our own hand-crafted modules for cuda 11.4 (from nvhpc/21.9) and PrgEnv-nvidia for nvhpc 21.9
module use /global/cfs/cdirs/ntrain9/neuron/modules
# Also, spack-generated modules for dependencies (caliper, ninja, py-*)
module use /global/cfs/cdirs/ntrain9/neuron/spack_modules/cray-sles15-zen2
# Load modules: prefer CUDA 11.4 from NVHPC/21.9
module load cudatoolkit/21.9_11.4
module load cmake nvidia/21.9 python caliper ninja py-pytest py-pyyaml py-jinja2 py-sympy boost/1.77.0
# Build CoreNEURON
cmake .. -G Ninja \
-DCORENRN_ENABLE_CALIPER_PROFILING=ON \
-DCORENRN_ENABLE_GPU=ON \
-DCORENRN_ENABLE_NMODL=ON \
-DCORENRN_EXTERNAL_BENCHMARK_DATA=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021 \
-DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_CUDA_COMPILER=nvcc \
-DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" \
-DCMAKE_CXX_COMPILER=CC \
-DCMAKE_CUDA_ARCHITECTURES=80 \
-DCORENRN_ENABLE_UNIT_TESTS=OFF
cmake --build . --parallel
ctest --output-on-failure -j 16 # optional
If you haven't allocated a session
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6
Note that we are setting NVCOMPILER_ACC_SYNCHRONOUS=1
below so that we get correct timings for individual kernels. Otherwise kernel timings are incorrect due to async launch.
# NVCOMPILER_ACC_SYNCHRONOUS=1 // if necessary
OMP_NUM_THREADS=6 \
CALI_CONFIG=runtime-report,calc.inclusive \
srun -n 1 benchmark/x86_64/special-core \
--datpath=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
--mpi --gpu --cell-permute=2 --tstop=100
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 de4e433 (2021-11-26 08:50:58 +0100)
Additional mechanisms from files
Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod
num_mpi=6
num_omp_thread=1
....
Solver Time : 16.2698
Simulation Statistics
Number of cells: 1320
Number of compartments: 648360
Number of presyns: 3961320
Number of input presyns: 0
Number of synapses: 3960002
Number of point processes: 7921322
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 11969
Number of spikes with non negative gid-s: 11969
Path Min time/rank Max time/rank Avg time/rank Time %
main 40.376340 40.378758 40.377549 99.399072
checkpoint 0.000001 0.000002 0.000002 0.000004
output-spike 0.004200 0.004211 0.004205 0.010353
simulation 16.269795 16.269813 16.269804 40.052045
spike-exchange 0.019603 0.041487 0.030545 0.075194
spike-exchange 0.019578 0.041452 0.030515 0.075120
communication 0.000886 0.000906 0.000896 0.002206
imbalance 0.018654 0.040551 0.029603 0.072874
timestep 16.225238 16.247191 16.236214 39.969356
state-update 4.167419 4.176959 4.172189 10.270849
state-SKv3_1 0.084673 0.086011 0.085342 0.210090
state-SK_E2 0.086975 0.087625 0.087300 0.214910
state-ProbGABAAB_EMS 0.176391 0.182205 0.179298 0.441385
state-ProbAMPANMDA_EMS 0.512746 0.521091 0.516919 1.272520
state-nax 0.135361 0.136370 0.135866 0.334466
state-NaTg 0.121694 0.121713 0.121704 0.299603
state-Nap_Et2 0.091850 0.092020 0.091935 0.226320
state-na3 0.126686 0.127354 0.127020 0.312690
state-K_Tst 0.083087 0.083141 0.083114 0.204605
state-K_Pst 0.086288 0.086488 0.086388 0.212665
state-kmb 0.078860 0.079250 0.079055 0.194613
state-KdShu2007 0.077877 0.078125 0.078001 0.192018
state-kdr 0.094169 0.095289 0.094729 0.233198
state-kdrb 0.092489 0.092745 0.092617 0.227999
state-kdb 0.089317 0.089467 0.089392 0.220060
state-kd2 0.087586 0.092668 0.090127 0.221869
state-kca 0.363553 0.364731 0.364142 0.896423
state-kap 0.109070 0.109326 0.109198 0.268817
state-kad 0.112513 0.113974 0.113244 0.278776
state-Ih 0.098592 0.100036 0.099314 0.244485
state-hd 0.093146 0.094326 0.093736 0.230754
state-cat 0.121305 0.121554 0.121430 0.298928
state-can 0.121067 0.122852 0.121960 0.300233
state-Ca_LVAst 0.129896 0.130772 0.130334 0.320849
state-cal 0.108788 0.109597 0.109192 0.268804
state-Ca_HVA2 0.142881 0.143095 0.142988 0.351999
state-cagk 0.097391 0.098042 0.097716 0.240553
state-cacum 0.234768 0.235131 0.234950 0.578385
state-cacumb 0.175342 0.176663 0.176002 0.433273
state-IClamp 0.002348 0.002558 0.002453 0.006039
state-CaDynamics_DC0 0.113226 0.113724 0.113475 0.279346
state-pas 0.002896 0.003032 0.002964 0.007297
update 0.116486 0.117232 0.116859 0.287677
second-order-cur 0.001943 0.002165 0.002054 0.005056
matrix-solver 6.071176 6.123533 6.097355 15.010108
setup-tree-matrix 4.745456 4.748895 4.747175 11.686317
cur-SKv3_1 0.104343 0.105509 0.104926 0.258301
cur-SK_E2 0.097890 0.098530 0.098210 0.241768
cur-ProbGABAAB_EMS 0.252629 0.255678 0.254154 0.625660
cur-ProbAMPANMDA_EMS 0.903469 0.903876 0.903672 2.224608
cur-nax 0.113687 0.115076 0.114382 0.281578
cur-NaTg 0.113117 0.113994 0.113556 0.279544
cur-Nap_Et2 0.084684 0.084767 0.084725 0.208572
cur-na3 0.095460 0.096301 0.095880 0.236033
cur-K_Tst 0.076803 0.076846 0.076825 0.189122
cur-K_Pst 0.083091 0.083296 0.083193 0.204801
cur-kmb 0.084998 0.085193 0.085095 0.209483
cur-KdShu2007 0.079360 0.079602 0.079481 0.195662
cur-kdr 0.084818 0.085221 0.085019 0.209296
cur-kdrb 0.083571 0.083837 0.083704 0.206058
cur-kdb 0.074649 0.075234 0.074942 0.184487
cur-kd2 0.074704 0.075905 0.075305 0.185380
cur-kca 0.102095 0.102144 0.102119 0.251392
cur-kap 0.093734 0.094259 0.093996 0.231395
cur-kad 0.096892 0.096943 0.096917 0.238586
cur-Ih 0.099069 0.099392 0.099230 0.244280
cur-hd 0.079998 0.080359 0.080179 0.197379
cur-cat 0.103624 0.103911 0.103767 0.255449
cur-can 0.118858 0.119161 0.119010 0.292971
cur-Ca_LVAst 0.123353 0.126097 0.124725 0.307041
cur-cal 0.122882 0.123123 0.123003 0.302800
cur-Ca_HVA2 0.142775 0.143094 0.142934 0.351868
cur-cagk 0.122285 0.123171 0.122728 0.302125
cur-cacum 0.081801 0.082285 0.082043 0.201969
cur-cacumb 0.074288 0.075040 0.074664 0.183803
cur-IClamp 0.085479 0.088503 0.086991 0.214149
cur-CaDynamics_DC0 0.064449 0.064505 0.064477 0.158726
cur-ttx_ion 0.063089 0.063417 0.063253 0.155713
cur-ca_ion 0.100195 0.100529 0.100362 0.247065
cur-k_ion 0.063816 0.064154 0.063985 0.157515
cur-na_ion 0.066107 0.066272 0.066190 0.162941
cur-pas 0.109251 0.109293 0.109272 0.268999
deliver-events 0.982300 1.005465 0.993883 2.446681
net-receive-ProbGABAAB_EMS 0.000950 0.001073 0.001012 0.002490
net-receive-ProbAMPANMDA_EMS 0.003695 0.003919 0.003807 0.009372
net-buf-receive-ExpSyn 0.003570 0.003671 0.003621 0.008913
net-buf-receive-Exp2Syn 0.003701 0.003835 0.003768 0.009276
net-buf-receive-ProbGABAAB_EMS 0.068577 0.070341 0.069459 0.170990
net-buf-receive-ProbAMPANMDA_EMS 0.090366 0.094045 0.092206 0.226986
update-net-receive-buf 0.399891 0.406366 0.403129 0.992398
net-receive-buf-cpu2gpu 0.379896 0.385496 0.382696 0.942098
net-receive-buf-order 0.003375 0.003533 0.003454 0.008503
check-threshold 0.182114 0.184026 0.183070 0.450671
finitialize 2.094235 2.094272 2.094254 5.155510
spike-exchange 0.000035 0.105358 0.052696 0.129725
spike-exchange 0.000031 0.105354 0.052693 0.129715
communication 0.000019 0.000022 0.000020 0.000050
imbalance 0.000006 0.105331 0.052669 0.129656
cur-SKv3_1 0.000031 0.000032 0.000031 0.000078
cur-SK_E2 0.000029 0.000031 0.000030 0.000074
cur-ProbGABAAB_EMS 0.000072 0.000073 0.000073 0.000178
cur-ProbAMPANMDA_EMS 0.000265 0.000267 0.000266 0.000655
cur-nax 0.000034 0.000034 0.000034 0.000084
cur-NaTg 0.000032 0.000034 0.000033 0.000081
cur-Nap_Et2 0.000025 0.000026 0.000025 0.000063
cur-na3 0.000029 0.000030 0.000029 0.000073
cur-K_Tst 0.000023 0.000024 0.000024 0.000058
cur-K_Pst 0.000025 0.000026 0.000025 0.000063
cur-kmb 0.000026 0.000027 0.000027 0.000065
cur-KdShu2007 0.000024 0.000024 0.000024 0.000059
cur-kdr 0.000026 0.000026 0.000026 0.000064
cur-kdrb 0.000025 0.000025 0.000025 0.000062
cur-kdb 0.000023 0.000023 0.000023 0.000057
cur-kd2 0.000024 0.000024 0.000024 0.000059
cur-kca 0.000031 0.000032 0.000031 0.000078
cur-kap 0.000027 0.000028 0.000027 0.000068
cur-kad 0.000028 0.000030 0.000029 0.000071
cur-Ih 0.000029 0.000030 0.000029 0.000073
cur-hd 0.000024 0.000025 0.000024 0.000060
cur-cat 0.000030 0.000032 0.000031 0.000076
cur-can 0.000035 0.000035 0.000035 0.000086
cur-Ca_LVAst 0.000035 0.000036 0.000035 0.000087
cur-cal 0.000036 0.000037 0.000036 0.000090
cur-Ca_HVA2 0.000039 0.000041 0.000040 0.000098
cur-cagk 0.000037 0.000038 0.000037 0.000092
cur-cacum 0.000025 0.000026 0.000025 0.000063
cur-cacumb 0.000024 0.000025 0.000024 0.000060
cur-IClamp 0.000027 0.000028 0.000027 0.000068
cur-CaDynamics_DC0 0.000021 0.000022 0.000022 0.000053
cur-ttx_ion 0.000026 0.000026 0.000026 0.000064
cur-ca_ion 0.000028 0.000029 0.000029 0.000070
cur-k_ion 0.000018 0.000019 0.000018 0.000046
cur-na_ion 0.000022 0.000023 0.000022 0.000055
cur-pas 0.000034 0.000034 0.000034 0.000084
update-net-receive-buf 0.000021 0.000023 0.000022 0.000054
load-model 21.857876 21.860913 21.859395 53.812170
To produce a useful profile with NSight Systems we need to configure Caliper to emit NVTX markers (CALI_CONFIG=nvtx
) and tell NSight Systems to record regions with names that are not registered strings (NSYS_NVTX_PROFILER_REGISTER_ONLY=0
).
To avoid profiling model initialisation and setup, you may want to only record the actual simulation (--capture-range=nvtx --nvtx-capture=simulation
).
Additionally, NSight Systems seems to have trouble profiling multiple OpenMP host threads launching GPU kernels at once, so you may want to disable that (OMP_NUM_THREADS=1
).
Taken together, an example prefix could be:
CALI_CONFIG=nvtx OMP_NUM_THREADS=1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0 --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation ./x86_64/special-core ...
Some other notes:
- during development, just use simple ring test (for quick iteration)
- install nmodl master into project space
$CFS/ntrain9/neuron
so that mentors or other people can just use standard version for profiling or other non-codegen related tasks.
- Repository and branch which includes lots of tests of OpenACC and OpenMP interoperability https://github.com/iomaganaris/openMP_GPU/tree/magkanar/example. Also the code mentioned in https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866/2
- With the hackathon_main branch, if we build with
-DCMAKE_BUILD_TYPE=Debug
,ring_permute2_TEST
andring_gap_permute2_TEST
are failing.