This repository contains a collection of benchmarks to evaluate hardware memory performance. Most benchmarks run stencil computations as used in numerical weather prediction and climate simulation. You may use these benchmarks for the following tasks:
- Measuring the sustainable memory bandwidth of some hardware (CPU or GPU)
- Comparing various implementation strategies for typical numerical schemes used in weather and climate simulation
- Estimating the performance of memory-bound algorithms on various hardware architectures
For most benchmarks, we provide several implementations. Each is specifically designed and optimized to either run on CPUs or AMD and NVIDIA GPUs. All common Linux systems are supported, and the benchmarks have been tested on x86 and aarch64 platforms.
This guide directs you through the installation and introduces the command line interface by showing some example invocations. It demonstrates how to run benchmarks, adapt parameters to your needs, and analyze the output.
Note that all provided stencil benchmarks are based on typical codes for numerical weather and climate prediction. Most weather and climate models use very similar numerical methods and thus computational patterns. However, stencil codes used in other scientific domains employ different schemes. Thus, you should not directly transfer performance results of the present benchmarks to stencil codes from other domains. Still, the provided implementations of the STREAM benchmark just measure the peak sustainable memory bandwidth. Thus, it provides an upper performance limit for any kind of bandwidth-bound algorithms.
Before installing, make sure that you have the following dependencies ready:
- A Python distribution (version ≥ 3.6)
- pip, preferably within a virtual environment
- A C++ compiler
- Optionally, only if you want to run GPU benchmarks: a CUDA or HIP compiler
Use the following command to directly install the code from the GitHub repository:
$ pip install git+https://github.com/GridTools/stencil_benchmarks.git
To verify the installation and availability of the command line interface, run the following command (which should just print a help message):
$ sbench --help
Under some circumstances, the installation finishes without error, but the last step fails. Then, pip failed to add the binary directory automatically to your PATH. To solve this issue, update the PATH environment variable manually or install inside a virtual environment (recommended).
The STREAM benchmark is a simple benchmark for measuring sustainable memory bandwidth using basic element-wise vector operations. We offer this benchmark in the following three flavors:
- The original CPU STREAM code by John D. McCalpin
- An optimized CPU implementation with advanced optimizations
- A CUDA/HIP-based GPU implementation
This section shows you how to run each of these, how to use the help system, and how to analyze the results.
By using the help system, you can easily find the correct command in four steps:
sbench --help
shows us that there are two subcommands:stencils
andstream
. Of course, you choosestream
.sbench stream --help
gives additional two subcommands:cuda-hip
andmc-calpin
. We don’t want to run on a GPU, so let’s takemc-calpin
.sbench stream mc-calpin --help
gives another two subcommands:original
andnative
. First, chooseoriginal
, which is the unmodified STREAM benchmark code.sbench stream mc-calpin original --help
does not given any further subcommands, but many options.
This means that you landed at a final executable command that does not have
further subcommands. So you found what you were looking for: sbench stream mc-calpin original
.
Optionally, you can also enable auto-completion in your shell by following the instructions in the click documentation.
As you found the command, you can just execute without specifying additional arguments. It will print a table with bandwidth and run time numbers (of course the numbers can be very different on your platform):
$ sbench stream mc-calpin original
bandwidth avg-time time max-time ticks
name
copy 157458.6 0.001385 0.001016 0.003338 862
scale 140483.3 0.001427 0.001139 0.002405 862
triad 157804.2 0.001867 0.001521 0.002930 862
add 155320.6 0.001798 0.001545 0.002721 862
Those are results from a dual-socket Xeon Gold 6140. So, we get around 150 GB/s which is not bad – but according to Intel, the peak achievable bandwidth should be around 190 GB/s. What’s missing?
First, you might actually make sure that you are really measuring memory bandwidth and not some cache performance. You can do this by increasing the array size. Just multiply the default value with 10 for now:
$ bench stream mc-calpin original --array-size 100000000
bandwidth avg-time time max-time ticks
name
copy 131937.8 0.015163 0.012127 0.122293 8550
scale 130536.6 0.013819 0.012257 0.013386 8550
add 146037.0 0.018579 0.016434 0.018062 8550
triad 146003.1 0.018432 0.016438 0.021834 8550
So it got even worse in our example, which means we got some help from the CPU’s caches due to too small data size! Our arrays are now 10⁸×8 B, so roughly 0.8 GB. This should be enough, to double check you can run again with a bigger size and make sure bandwidths stay approximately the same). If your bandwidth numbers did not decrease when you increased the array size, you can of course also continue with the original size as your CPU might just have a smaller cache size.
Now, you are ready to try a better, that is, an optimized STREAM
implementation. This is as easy as swapping original
with native
in your
command: sbench stream mc-calpin native --array-size 100000000
. This enables
custom STREAM kernel (if you want to see the code, just pass
--print-kernels
). But you might be disappointed: the result after swapping
the kernels should be pretty much the same. The native
implementation indeed
employs manual vectorization, but the STREAM code is so simple that every
compiler should be able to automatically vectorize it, so there is no big
difference by default. But by selecting the correct architecture, you should get much
better results:
$ sbench stream mc-calpin native --array-size 100000000 --architecture x86-avx512
bandwidth avg-time time max-time ticks
name
copy 184842.4 0.009964 0.008656 0.009932 8696
scale 184011.1 0.009923 0.008695 0.010091 8696
triad 185989.9 0.014748 0.012904 0.014969 8696
add 185746.2 0.014778 0.012921 0.014376 8696
The reason for this improvement are non-temporal or streaming store instructions. They avoid an unnecessary load of the output array data, which significantly reduces the required memory transfer.
This is almost 190 GB/s now, but correctly setting the thread affinity and reducing the thread count (not using hyper-threading) helps a bit:
$ OMP_NUM_THREADS=36 OMP_PLACES='{0}:36' sbench stream mc-calpin native --array-size 100000000 --architecture x86-avx512
bandwidth avg-time time max-time ticks
name
copy 189129.6 0.009425 0.008460 0.008664 8516
scale 188322.9 0.009489 0.008496 0.008581 8516
add 191480.7 0.013973 0.012534 0.012644 8516
triad 191309.6 0.013987 0.012545 0.012672 8516
Things to remember:
- The STREAM array size must be large enough, otherwise you are not measuring memory bandwidth.
- Achieving peak STREAM bandwidth requires architecture-specific optimizations. On most CPUs this includes non-temporal/streaming stores.
- Setting CPU affinity helps to achieve peak bandwidth.
- Hyper-threading, or simultaneous multithreading in general, is often not required to achieve peak memory bandwidth.
The STREAM benchmark results give the same output as the original STREAM code (on the CPU, it actually runs the original STREAM code). Thus, the bandwidth results are in megabytes per second (MB/s = 10⁶B/s). Note that some other tools report bandwidth numbers in mebibytes per second (MiB/s = 2²⁰B/s). And some tools say they report in MB/s but actually report MiB/s! So make sure to double-check your bandwidth units!
Note:
sbench
currently supports NVIDIA GPUs using CUDA and AMD GPUs using HIP with a matching compiler (NVCC or HIPCC).
Besides the original STREAM implementation for CPUs, sbench
also comes with a
highly tunable implementation for GPUs. In the previous section, you learnt how
to find the correct command. By applying the same strategy, you can find and
run the following:
$ sbench stream cuda-hip native --compiler nvcc
WARNING: adapting array size to match block and vector sizes
bandwidth avg-time time max-time
name
copy 785206.0 0.000208 0.000204 0.000219
scale 785206.0 0.000208 0.000204 0.000219
add 811017.3 0.000300 0.000296 0.000313
triad 811017.3 0.000299 0.000296 0.000313
Pretty fast! But maybe that’s not pure memory bandwidth but just cache
performance? From the previous section, you know that you should increase the
array size to check. You probably also noticed the warning WARNING: adapting array size to match block and vector sizes
. You can avoid that warning by
choosing an array size which is divisible by the block and vector size. You can
just choose a power of two to almost certainly fix that warning, independently
of further parameter changes. So let’s try 2²⁷:
$ sbench stream cuda-hip native --compiler nvcc --array-size 134217728
bandwidth avg-time time max-time
name
scale 801663.6 0.002783 0.002679 0.002869
copy 796194.7 0.002786 0.002697 0.002867
add 825224.0 0.004034 0.003903 0.004131
triad 823711.4 0.004031 0.003911 0.004126
This actually increased bandwidth, which either means that previously there was not enough data to fully occupy the device or frequency boosting plays some unhappy role. In either case you might want to increase again by a factor two:
$ sbench stream cuda-hip native --compiler nvcc --array-size 268435456
bandwidth avg-time time max-time
name
copy 803352.6 0.005589 0.005346 0.005783
scale 803352.6 0.005568 0.005346 0.005768
add 818986.7 0.008108 0.007866 0.008335
triad 818773.6 0.008080 0.007868 0.008276
On our device, an NVIDIA V100, further doubling does not change the numbers
significantly, so we stay with 268435456. Of course you might have to adapt the
array size differently on a different device. After playing with the available
options (use --help
again), we came up with the following:
$ CUDA_AUTO_BOOST=0 sbench stream cuda-hip native --compiler nvcc --array-size 268435456 --vector-size 2
bandwidth avg-time time max-time
name
copy 823704.6 0.005493 0.005214 0.005633
scale 823704.6 0.005492 0.005214 0.005633
triad 841328.7 0.007934 0.007657 0.008161
add 840429.6 0.007953 0.007666 0.008177
On NVIDIA GPUs, you might want to set the environment variable
CUDA_AUTO_BOOST=0
to minimize the influence of frequency throttling. Further, experimenting with different floating point types (e.g.,--dtype float32
) and vector sizes might be interesting.
On AMD GPUs,
--no-explicit-vectorization
in combination with--streaming-loads
,--streaming-stores
, and--unroll-factor
might further help to achieve higher bandwidth.
Here, you will learn how to run parameter ranges, a very powerful feature of
sbench
, and how to store and analyze its output using sbench-analyze
.
sbench
allows you to provide parameter ranges instead of single values to
most of its command line options. Ranges are identified by opening and closing
brackets ([
, ]
). There are two kinds of ranges: comma-separated ranges and
numeric ranges.
Using a comma-separated range, you can for example run using single precision (float32/float) and double precision (float64/double) numbers in one command:
$ sbench stream cuda-hip native --compiler nvcc --dtype [float32,float64]
WARNING: adapting array size to match block and vector sizes
bandwidth avg-time time max-time
dtype name
float32 copy 751230.8 0.000108 0.000106 0.000109
scale 751005.1 0.000108 0.000107 0.000109
add 786523.5 0.000154 0.000153 0.000156
triad 781280.0 0.000154 0.000154 0.000155
float64 copy 785206.0 0.000208 0.000204 0.000219
scale 781402.1 0.000208 0.000205 0.000220
add 811105.0 0.000300 0.000296 0.000313
triad 808307.8 0.000298 0.000297 0.000313
Using a numeric range, you can easily run for a large set of numbers. For
example, for many array sizes using --array-size [4096-268435456:*2]
. The
syntax for numeric ranges is [start-stop:step]
where start
and stop
are
any numbers and step
consists of an operator (+
, -
, *
, /
) and a
number. start
and stop
are inclusive. So in this example, you would run for
the array sizes 4096, 8192, …, 134217728, 268435456.
Note that you can use multiple range arguments. In this case, the benchmark
will be run using all possible combinations of arguments. We used the following
command to obtain some example data to example.csv
:
$ sbench -o example.csv stream cuda-hip native --compiler nvcc --array-size [1024-268435456:*2] --dtype float32 --vector-size [1,2,4] --block-size [32-1024:*2]
The presented range arguments can often be enough to run all combinations you need. But if you need more flexibility, it’s also possible to combine the output of multiple runs using
sbench-analyze merge
.
Our example output is included in the repository and can be downloaded directly from here to follow this example analysis easily.
Start by looking at the values which haven’t changed between all the runs. This can be done by the following command:
$ sbench-analyze print --common example.csv
axis x
benchmark-name stencil_benchmarks.benchmarks_collection.stream.cuda_hip.Native
compiler nvcc
compiler-flags
dtype float32
explicit-vectorization True
index-type std::size_t
launch-bounds True
ntimes 10
print-code False
sbench-version 0.11.0
streaming-loads False
streaming-stores False
unroll-factor 1
verify True
If you remove the flag --common
, you will get a huge table output:
$ sbench-analyze print --common example.csv
name bandwidth avg-time time max-time array-size block-size vector-size
0 copy 8000.0 0.000005 0.000004 0.000006 4096 32 1
1 scale 8000.0 0.000005 0.000004 0.000005 4096 32 1
2 add 12000.0 0.000005 0.000004 0.000005 4096 32 1
3 triad 9600.0 0.000005 0.000005 0.000005 4096 32 1
4 copy 6400.0 0.000005 0.000005 0.000005 4096 32 1
5 scale 6400.0 0.000005 0.000005 0.000006 4096 32 1
6 add 9660.4 0.000005 0.000005 0.000005 4096 32 1
7 triad 9660.4 0.000005 0.000005 0.000005 4096 32 1
8 copy 8000.0 0.000005 0.000004 0.000005 4096 32 1
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Each row corresponds to collected data from one benchmark run. Together with the common values from above, these are actually all command line inputs and results. Together with the system info (you have to store that separately), it should thus be straightforward to reproduce some results.
You can of course read the CSV data into your favorite spread sheet editor, but
you can also extract the most useful information with sbench-analyze
.
As a first example, you might be interested in the best block and vector size for the largest array size. For this, several arguments are required:
--filter '`array-size` == 268435456'
to filter the data by the domain size.--select bandwidth
to select the bandwidth as main quantity to display.--group vector-size --group block-size
to select the output row and column quantities.--unstack
to get a well-readable 2D table by using the last group (hereblock-size
) as column axis. Try without this option to see the difference.
The full command and output is as follows (using short options where possible):
$ sbench-analyze print -g vector-size -g block-size -u -s bandwidth -f '`array-size` == 268435456' example.csv
block-size 32 64 128 256 512 1024
vector-size
1 205043.50 407019.70 720400.80 791182.30 788848.5 784519.75
2 409830.60 736322.25 813471.05 812840.45 810799.9 808631.25
4 736322.25 827528.10 827850.90 828122.70 828453.7 827850.90
So it looks like the vector size of 4 is required to get full bandwidth. Also
the block size should probably be at least 64. To verify this, you can plot a
graph with the array size on the horizontal axis, the bandwidth on the vertical
axis and a curve for each block size. This can be done using sbench-analyze plot
and the following options:
--filter '`vector-size` == 4'
to only use data of the optimal vector size.--select bandwidth
to select the bandwidth as main quantity to display.--group block-size --group array-size
to select the output row and column quantities.--uniform
to evenly space the array size values on the x axis.--title 'Block Size Analysis'
to set the plot title.--output example_block_size.svg
to save the output as a SVG image (you can choose any image format supported by Matplotlib).
So the full command is:
$ sbench-analyze plot -g block-size -g array-size -s bandwidth -f '`vector-size` == 4' --uniform -o example_block_size.png -t 'Block Size Analysis' example.csv
And the output, which confirms our assumption that the block size has to be at least 128:
To conclude, you can finally plot the peak sustainable memory bandwidth dependent on the array size using the following command:
$ sbench-analyze plot -g name -g array-size -s bandwidth -f '`vector-size` == 4 and `block-size` == 128' --uniform -o example_achievable_bandwidth.svg -t 'Achievable Bandwidth' example.csv
Which gives the result:
Besides the STREAM benchmark, sbench
includes a set of stencil-like
computations on structured Cartesian 3D domains. They can be run with sbench stencils BACKEND STENCIL IMPLEMENTATION
, where BACKEND
, STENCIL
, and
IMPLEMENTATION
are placeholders. For BACKEND
you can choose between:
openmp
: generated OpenMP code with many options.openmp-blocked
: generated OpenMP code using a blocked storage layout.cuda-hip
: generated CUDA/HIP code for NVIDIA/AMD GPUs.jax
: experimental Jax implementations.numpy
: simple Numpy implementations.numba-cpu
: incomplete Numba implementations.
For STENCIL
, you can choose between basic
, horizontal-diffusion
and
vertical-advection
for most backends.
When choosing basic
, most backends provide implementations for the following stencils:
- Copy stencil:
$out_{i, j, k} = in_{i, j, k}$ . The most trivial stencil; similar to STREAM copy, but on a 3D domain. - One-sided average:
$out_{i, j, k} = {1 \over 2}(in_{i, j, k} + in_{i + 1, j, k})$ . Useful to find problems with large strides. - Symmetric average:
$out_{i, j, k} = {1 \over 2}(in_{i - 1, j, k} + in_{i + 1, j, k})$ . - Standard 7-point Laplacian.
horizontal-diffusion
is a composed stencil, vertical-advection
is a
vertical tridiagonal solver. Both are very similar to the computations used in
the COSMO weather model. Some backends provide many different implementations
for these computations.
If you have problems using the software provided in this repository, please open a GitHub issue.