Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

NERSC GPU hackathon (Dec 2021)

Pramod Kumbhar edited this page Dec 10, 2021 · 84 revisions

NERSC GPU hackathon (Dec 2021)

This page summarises preparations for the hackathon on the 2nd/8th/9th/10th December 2021. We will use CoreNEURON+NMODL with a few sets of test model data.

NOTE If you see any issues that we need to be handled after hackathon should be created in https://github.com/neuronsimulator/gpuhackathon/issues

Preparatory Tasks

  • Olli - Install Caliper, Ninja, NMODL and NMODL's Python dependencies on Ascent.

  • Pramod - Update Ascent instructions to enable NMODL (same as NERSC).

  • Olli - Channel benchmark failing on Ascent during load input dataset. Should we re-generate data on Ascent?

  • Ioannis - Generate & copy input dataset for Solver analysis and optimisation during Hackathon. Add info on this wiki page for the same.

  • Omar - List of all OpenACC directives and APIs used in the CoreNEURON + NMODL. See https://github.com/neuronsimulator/gpuhackathon/blob/main/overview.md

  • Olli - Common profiling/benchmarking script which could compare & plot OpenMP vs OpenACC runtimes

  • First Day Presentation Preparation

    • Pramod + Olli: 3 min Introduction
    • Ioannis + Christos: Solver introduction to mentors
    • Pramod + Omar + Olli: Compute loop + DSL code generation introduction to mentors

Hackathon Tasks

NEURON GPU Enabled Builds on systems like Perlmutter and Ascent

  • Alex: Check the neuron-gpu-nightly execution on Perlmutter. Do they work with/without MPI?
  • Alex: NEURON + CoreNEURON - Interviews and NEURON + CoreNEURON + NMODL - Interviews build on Perlmutter & Ascent. Do they work?
    • Alex: Should we automatically disable Interviews on the Cray systems and IBM Power systems by detection via CMake?
  • Alex: Check if there is any improvement needed for building from source on Hackathon systems.

OpenMP migration of CoreNEURON NrnThread / model data transfer

Random123 and OpenMP portability

Hines Solver Analysis and Optimisation

  • Add an option in CoreNEURON that only executes Hines solver ?
  • Find out a way to dynamically set a good number for the nwarp variable (used to distribute the cells)
  • Ioannis + Christos: Profiling of a realistic model and analyse different performance metrics with mentors to understand the limitations.
  • Ioannis + Christos: With current algorithm, investigate possible performance improvement opportunities
  • Ioannis + Christos: Performance comparison of OpenACC vs OpenMP vs CUDA implementation
  • Ioannis + Christos: Would benefit from special memory types usage?
  • Ioannis + Christos: Look into algorithmic improvements to expose more parallelism (if suitable)

NMODL Code Generation & OpenMP Migration

CoreNEURON OpenMP migration

  • Build system changes to enable OpenMP target offload as well as OpenACC
  • Olli: Try simple OpenMP + OpenACC offload test (e.g. with inbuilt ring test)
  • Olli: Update Hines Solver in coreneuron with OpenMP (ongoing)
  • Pramod: Measure performance difference between OpenMP and OpenACC runs
  • Olli: Under coreneuron event communication part, start tackling OpenACC pragmas or API calls that are not converted to OpenMP

Eigen Compatibility Issues

Portable wheels

  • Pramod: Prepare a slide describing the goals - what currently we do and what we would like to do?
  • Pramod: Organise / plan a discussion with a / few nvidia compiler engineers. Central question is - ABI compatibility across different CUDA + OpenMP + OpenACC runtimes/compilers.

Unified memory performance : optional / low-priority

  • First, check/fix https://github.com/BlueBrain/CoreNeuron/issues/594
  • Re-measure the slowdown from using unified memory (ringtest + channel-benchmark)
  • Identify which data structures are causing the slowdown. How to identify this easily with the tools?

Ascent

These are instructions to build + run on Ascent (login1.ascent.olcf.ornl.gov).

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH

Note that Boost is an optional dependency, but it has an apparently-significant impact on both initialisation time and profiling time when using NSight Compute.

# clone repository
cd $HOME
git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron # or: git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git
cd CoreNeuron
git checkout 96498142ce60d15edbd4c5161c7153f41eda6b20 #for the mentors
git submodule update -f --init --recursive
mkdir -p build && cd build
cmake .. -G Ninja -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0"  -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="-gpu=cuda11.0" -DCORENRN_ENABLE_UNIT_TESTS=OFF
cmake --build . --parallel 8

Running

As $HOME is not writable, create your own directory into the project directory:

mkdir -p /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
cd /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER

Now run a tiny, functional test on the GPU:

  • Allocate a node
bsub -P GEN170 -J neuron -W 90 -nnodes 1 -alloc_flags "gpudefault" -Is $SHELL

Note: gpumps here would enable MPS, but that is incompatible with the CUDA 11.0 version we are using here.

  • Make necessary modules are loaded:
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
  • Run simple function test on the GPU:
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi
  • How to generate NSight System report with simple ringtest:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-systems/2021.5.1.117
OMP_NUM_THREADS=1 jsrun -E LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/pami_471/libpami.so --gpu_per_rs 1 -n 1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0,CALI_CONFIG=nvtx,OMP_NUM_THREADS=$OMP_NUM_THREADS --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2
  • How to generate full NSight Compute report with simple ringtest for a single kernel:
module load nvhpc/21.9
module swap cuda cuda/11.0.2
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
module load nsight-compute/2021.2.2
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 ncu -k <kernel_name_you_want_to_analyze> -f --profile-from-start=off --target-processes all --set full -o ring_report $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi --cell-permute 2

Which Input Model Datasets To Use?

For the purposes of the hackathon we have provided two datasets that cover the basic use cases for the coming days.

1. Small dataset for development & testing

For running small tests you should use the following data set:

/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank

This data set has a circuit of 1440 cells and is meant to be simulated with 1 MPI rank on 1 GPU.

To run CoreNEURON with this data set you may use the following command:

NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 \
  jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1440-cells-1-rank \
  --mpi --gpu --cell-permute=2 --tstop=5

2. Large dataset for profiling and benchmarking

For running the simulator to gather profiling or other performance measurements you should use the following data set:

/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks

This data set resembles a production level simulation with a large enough circuit to fill up most of the GPU memory (~14GB). It is meant to be executed on 6 MPI ranks or 6 threads on 1 GPU and consists of 8800 cells. Note that MPS needs to be enabled if you are using pure MPI execution and MPS is not currently working on Ascent with CUDA 11.0, so we need to run this with 1 rank / 6 threads / 1 GPU.

To run CoreNEURON with this data set you may use the following command:

NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=6 \
  jsrun -n 1 --cpu_per_rs 6 --tasks_per_rs 1 --gpu_per_rs 1 --bind rs $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
  --datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
  --mpi --gpu --cell-permute=2 --tstop=5

Note: The datasets were generated with 1k synapses/cell and 100 stimuli/cell. Number of synapses and stimuli are fewer than the original model.

Perlmutter

Building

  • The system-wide modules only go up to NVHPC 21.7, which has known issues with NEURON. We have ourselves inserted an installation of NVHPC 21.9 (module nvidia/21.9) into the Cray Programming Environment setup.
  • Note that NVHPC 21.9 is configured to use the system GCC 7.5 standard library, while we have built some other dependencies with GCC 9.2. This seems to be "close enough".
  • -tp haswell or -DR123_USE_SSE=0 is required because nvc++ defaults to -tp zen on the Perlmutter nodes, which defines __ABM__ and causes Random123 to try and include intrin.h, which fails.
# clone repository
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron
git submodule update -f --init --recursive
mkdir -p build && cd build

# allocate node
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6

# Use our own hand-crafted modules for cuda 11.4 (from nvhpc/21.9) and PrgEnv-nvidia for nvhpc 21.9
module use /global/cfs/cdirs/ntrain9/neuron/modules
# Also, spack-generated modules for dependencies (caliper, ninja, py-*)
module use /global/cfs/cdirs/ntrain9/neuron/spack_modules/cray-sles15-zen2

# Load modules: prefer CUDA 11.4 from NVHPC/21.9
module load cudatoolkit/21.9_11.4
module load cmake nvidia/21.9 python caliper ninja py-pytest py-pyyaml py-jinja2 py-sympy boost/1.77.0

# Build CoreNEURON
cmake .. -G Ninja \
  -DCORENRN_ENABLE_CALIPER_PROFILING=ON  \
  -DCORENRN_ENABLE_GPU=ON \
  -DCORENRN_ENABLE_NMODL=ON \
  -DCORENRN_EXTERNAL_BENCHMARK_DATA=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021 \
  -DCMAKE_INSTALL_PREFIX=../install \
  -DCMAKE_CUDA_COMPILER=nvcc \
  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" \
  -DCMAKE_CXX_COMPILER=CC \
  -DCMAKE_CUDA_ARCHITECTURES=80 \
  -DCORENRN_ENABLE_UNIT_TESTS=OFF

cmake --build . --parallel
ctest --output-on-failure -j 16 # optional

Running on interactive session

If you haven't allocated a session

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 1 --account=ntrain9_g -n 1 --cpus-per-task 6

Note that we are setting NVCOMPILER_ACC_SYNCHRONOUS=1 below so that we get correct timings for individual kernels. Otherwise kernel timings are incorrect due to async launch.

Run a channel-benchmark test

# NVCOMPILER_ACC_SYNCHRONOUS=1 // if necessary

OMP_NUM_THREADS=6 \
CALI_CONFIG=runtime-report,calc.inclusive \
  srun -n 1 benchmark/x86_64/special-core \
  --datpath=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-8800-cells-6-ranks \
  --mpi --gpu --cell-permute=2 --tstop=100

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 de4e433 (2021-11-26 08:50:58 +0100)

 Additional mechanisms from files
 Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod

 num_mpi=6
 num_omp_thread=1


....
Solver Time : 16.2698


 Simulation Statistics
 Number of cells: 1320
 Number of compartments: 648360
 Number of presyns: 3961320
 Number of input presyns: 0
 Number of synapses: 3960002
 Number of point processes: 7921322
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 11969
 Number of spikes with non negative gid-s: 11969
Path                                     Min time/rank Max time/rank Avg time/rank Time %
main                                         40.376340     40.378758     40.377549 99.399072
  checkpoint                                  0.000001      0.000002      0.000002  0.000004
  output-spike                                0.004200      0.004211      0.004205  0.010353
  simulation                                 16.269795     16.269813     16.269804 40.052045
    spike-exchange                            0.019603      0.041487      0.030545  0.075194
      spike-exchange                          0.019578      0.041452      0.030515  0.075120
        communication                         0.000886      0.000906      0.000896  0.002206
        imbalance                             0.018654      0.040551      0.029603  0.072874
    timestep                                 16.225238     16.247191     16.236214 39.969356
      state-update                            4.167419      4.176959      4.172189 10.270849
        state-SKv3_1                          0.084673      0.086011      0.085342  0.210090
        state-SK_E2                           0.086975      0.087625      0.087300  0.214910
        state-ProbGABAAB_EMS                  0.176391      0.182205      0.179298  0.441385
        state-ProbAMPANMDA_EMS                0.512746      0.521091      0.516919  1.272520
        state-nax                             0.135361      0.136370      0.135866  0.334466
        state-NaTg                            0.121694      0.121713      0.121704  0.299603
        state-Nap_Et2                         0.091850      0.092020      0.091935  0.226320
        state-na3                             0.126686      0.127354      0.127020  0.312690
        state-K_Tst                           0.083087      0.083141      0.083114  0.204605
        state-K_Pst                           0.086288      0.086488      0.086388  0.212665
        state-kmb                             0.078860      0.079250      0.079055  0.194613
        state-KdShu2007                       0.077877      0.078125      0.078001  0.192018
        state-kdr                             0.094169      0.095289      0.094729  0.233198
        state-kdrb                            0.092489      0.092745      0.092617  0.227999
        state-kdb                             0.089317      0.089467      0.089392  0.220060
        state-kd2                             0.087586      0.092668      0.090127  0.221869
        state-kca                             0.363553      0.364731      0.364142  0.896423
        state-kap                             0.109070      0.109326      0.109198  0.268817
        state-kad                             0.112513      0.113974      0.113244  0.278776
        state-Ih                              0.098592      0.100036      0.099314  0.244485
        state-hd                              0.093146      0.094326      0.093736  0.230754
        state-cat                             0.121305      0.121554      0.121430  0.298928
        state-can                             0.121067      0.122852      0.121960  0.300233
        state-Ca_LVAst                        0.129896      0.130772      0.130334  0.320849
        state-cal                             0.108788      0.109597      0.109192  0.268804
        state-Ca_HVA2                         0.142881      0.143095      0.142988  0.351999
        state-cagk                            0.097391      0.098042      0.097716  0.240553
        state-cacum                           0.234768      0.235131      0.234950  0.578385
        state-cacumb                          0.175342      0.176663      0.176002  0.433273
        state-IClamp                          0.002348      0.002558      0.002453  0.006039
        state-CaDynamics_DC0                  0.113226      0.113724      0.113475  0.279346
        state-pas                             0.002896      0.003032      0.002964  0.007297
      update                                  0.116486      0.117232      0.116859  0.287677
      second-order-cur                        0.001943      0.002165      0.002054  0.005056
      matrix-solver                           6.071176      6.123533      6.097355 15.010108
      setup-tree-matrix                       4.745456      4.748895      4.747175 11.686317
        cur-SKv3_1                            0.104343      0.105509      0.104926  0.258301
        cur-SK_E2                             0.097890      0.098530      0.098210  0.241768
        cur-ProbGABAAB_EMS                    0.252629      0.255678      0.254154  0.625660
        cur-ProbAMPANMDA_EMS                  0.903469      0.903876      0.903672  2.224608
        cur-nax                               0.113687      0.115076      0.114382  0.281578
        cur-NaTg                              0.113117      0.113994      0.113556  0.279544
        cur-Nap_Et2                           0.084684      0.084767      0.084725  0.208572
        cur-na3                               0.095460      0.096301      0.095880  0.236033
        cur-K_Tst                             0.076803      0.076846      0.076825  0.189122
        cur-K_Pst                             0.083091      0.083296      0.083193  0.204801
        cur-kmb                               0.084998      0.085193      0.085095  0.209483
        cur-KdShu2007                         0.079360      0.079602      0.079481  0.195662
        cur-kdr                               0.084818      0.085221      0.085019  0.209296
        cur-kdrb                              0.083571      0.083837      0.083704  0.206058
        cur-kdb                               0.074649      0.075234      0.074942  0.184487
        cur-kd2                               0.074704      0.075905      0.075305  0.185380
        cur-kca                               0.102095      0.102144      0.102119  0.251392
        cur-kap                               0.093734      0.094259      0.093996  0.231395
        cur-kad                               0.096892      0.096943      0.096917  0.238586
        cur-Ih                                0.099069      0.099392      0.099230  0.244280
        cur-hd                                0.079998      0.080359      0.080179  0.197379
        cur-cat                               0.103624      0.103911      0.103767  0.255449
        cur-can                               0.118858      0.119161      0.119010  0.292971
        cur-Ca_LVAst                          0.123353      0.126097      0.124725  0.307041
        cur-cal                               0.122882      0.123123      0.123003  0.302800
        cur-Ca_HVA2                           0.142775      0.143094      0.142934  0.351868
        cur-cagk                              0.122285      0.123171      0.122728  0.302125
        cur-cacum                             0.081801      0.082285      0.082043  0.201969
        cur-cacumb                            0.074288      0.075040      0.074664  0.183803
        cur-IClamp                            0.085479      0.088503      0.086991  0.214149
        cur-CaDynamics_DC0                    0.064449      0.064505      0.064477  0.158726
        cur-ttx_ion                           0.063089      0.063417      0.063253  0.155713
        cur-ca_ion                            0.100195      0.100529      0.100362  0.247065
        cur-k_ion                             0.063816      0.064154      0.063985  0.157515
        cur-na_ion                            0.066107      0.066272      0.066190  0.162941
        cur-pas                               0.109251      0.109293      0.109272  0.268999
      deliver-events                          0.982300      1.005465      0.993883  2.446681
        net-receive-ProbGABAAB_EMS            0.000950      0.001073      0.001012  0.002490
        net-receive-ProbAMPANMDA_EMS          0.003695      0.003919      0.003807  0.009372
        net-buf-receive-ExpSyn                0.003570      0.003671      0.003621  0.008913
        net-buf-receive-Exp2Syn               0.003701      0.003835      0.003768  0.009276
        net-buf-receive-ProbGABAAB_EMS        0.068577      0.070341      0.069459  0.170990
        net-buf-receive-ProbAMPANMDA_EMS      0.090366      0.094045      0.092206  0.226986
        update-net-receive-buf                0.399891      0.406366      0.403129  0.992398
          net-receive-buf-cpu2gpu             0.379896      0.385496      0.382696  0.942098
          net-receive-buf-order               0.003375      0.003533      0.003454  0.008503
        check-threshold                       0.182114      0.184026      0.183070  0.450671
  finitialize                                 2.094235      2.094272      2.094254  5.155510
    spike-exchange                            0.000035      0.105358      0.052696  0.129725
      spike-exchange                          0.000031      0.105354      0.052693  0.129715
        communication                         0.000019      0.000022      0.000020  0.000050
        imbalance                             0.000006      0.105331      0.052669  0.129656
    cur-SKv3_1                                0.000031      0.000032      0.000031  0.000078
    cur-SK_E2                                 0.000029      0.000031      0.000030  0.000074
    cur-ProbGABAAB_EMS                        0.000072      0.000073      0.000073  0.000178
    cur-ProbAMPANMDA_EMS                      0.000265      0.000267      0.000266  0.000655
    cur-nax                                   0.000034      0.000034      0.000034  0.000084
    cur-NaTg                                  0.000032      0.000034      0.000033  0.000081
    cur-Nap_Et2                               0.000025      0.000026      0.000025  0.000063
    cur-na3                                   0.000029      0.000030      0.000029  0.000073
    cur-K_Tst                                 0.000023      0.000024      0.000024  0.000058
    cur-K_Pst                                 0.000025      0.000026      0.000025  0.000063
    cur-kmb                                   0.000026      0.000027      0.000027  0.000065
    cur-KdShu2007                             0.000024      0.000024      0.000024  0.000059
    cur-kdr                                   0.000026      0.000026      0.000026  0.000064
    cur-kdrb                                  0.000025      0.000025      0.000025  0.000062
    cur-kdb                                   0.000023      0.000023      0.000023  0.000057
    cur-kd2                                   0.000024      0.000024      0.000024  0.000059
    cur-kca                                   0.000031      0.000032      0.000031  0.000078
    cur-kap                                   0.000027      0.000028      0.000027  0.000068
    cur-kad                                   0.000028      0.000030      0.000029  0.000071
    cur-Ih                                    0.000029      0.000030      0.000029  0.000073
    cur-hd                                    0.000024      0.000025      0.000024  0.000060
    cur-cat                                   0.000030      0.000032      0.000031  0.000076
    cur-can                                   0.000035      0.000035      0.000035  0.000086
    cur-Ca_LVAst                              0.000035      0.000036      0.000035  0.000087
    cur-cal                                   0.000036      0.000037      0.000036  0.000090
    cur-Ca_HVA2                               0.000039      0.000041      0.000040  0.000098
    cur-cagk                                  0.000037      0.000038      0.000037  0.000092
    cur-cacum                                 0.000025      0.000026      0.000025  0.000063
    cur-cacumb                                0.000024      0.000025      0.000024  0.000060
    cur-IClamp                                0.000027      0.000028      0.000027  0.000068
    cur-CaDynamics_DC0                        0.000021      0.000022      0.000022  0.000053
    cur-ttx_ion                               0.000026      0.000026      0.000026  0.000064
    cur-ca_ion                                0.000028      0.000029      0.000029  0.000070
    cur-k_ion                                 0.000018      0.000019      0.000018  0.000046
    cur-na_ion                                0.000022      0.000023      0.000022  0.000055
    cur-pas                                   0.000034      0.000034      0.000034  0.000084
    update-net-receive-buf                    0.000021      0.000023      0.000022  0.000054
  load-model                                 21.857876     21.860913     21.859395 53.812170

Produce an NSight Systems profile

To produce a useful profile with NSight Systems we need to configure Caliper to emit NVTX markers (CALI_CONFIG=nvtx) and tell NSight Systems to record regions with names that are not registered strings (NSYS_NVTX_PROFILER_REGISTER_ONLY=0). To avoid profiling model initialisation and setup, you may want to only record the actual simulation (--capture-range=nvtx --nvtx-capture=simulation). Additionally, NSight Systems seems to have trouble profiling multiple OpenMP host threads launching GPU kernels at once, so you may want to disable that (OMP_NUM_THREADS=1). Taken together, an example prefix could be:

CALI_CONFIG=nvtx OMP_NUM_THREADS=1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0 --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation ./x86_64/special-core ...

Some other notes:

  • during development, just use simple ring test (for quick iteration)
  • install nmodl master into project space $CFS/ntrain9/neuron so that mentors or other people can just use standard version for profiling or other non-codegen related tasks.

Additional useful repositories

Bugs

  • With the hackathon_main branch, if we build with -DCMAKE_BUILD_TYPE=Debug, ring_permute2_TEST and ring_gap_permute2_TEST are failing.