Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial: put a function on the roofline model #28

Open
antoine-levitt opened this issue May 6, 2022 · 23 comments
Open

Tutorial: put a function on the roofline model #28

antoine-levitt opened this issue May 6, 2022 · 23 comments
Labels
discussion documentation Improvements or additions to documentation

Comments

@antoine-levitt
Copy link

Great package! As far as I understand it provides the information needed to locate an application on the roofline model but finding the info without screwing up feels a bit scary to me. Could an example be put in a tutorial or in the library, so that I could do something like roofline(f) and it'd plot the roofline model and locate f() on it?

(this is for teaching basic performance notions)

@antoine-levitt antoine-levitt changed the title Tutorial, put a function on the roofline model Tutorial: put a function on the roofline model May 6, 2022
@carstenbauer
Copy link
Member

Hey Antoine,

That's a great idea and something I thought about as well at some point. Would be really nice to have. LIKWID itself has an "Empirical Roofline Model" tutorial and I think we should add a similar tutorial to the LIKWID.jl documentation.
Automating it fully (as in a roofline(f) function) is a bit tricky though. A few (potentially obvious) things that come to mind:

  • How to reliably obtain the maximal performance and data bandwidth of the system? Typically, one needs to either put in these values by hand (e.g. derived from data sheets of the CPU) or perform benchmarks to empirically estimate these quantities. But that can be hard and require some fine tuning, in particular for the peak performance (for the max bandwidth we could use STREAMBenchmark.jl).
  • Thread pinning: How should we pin Julia threads (i.e. for optimal performance of a general f)? Threads need to be pinned properly to monitor performance with hardware counters. (Could restrict usage to single-threaded functions?)

Having said that, it's still worth trying! I'll happily support any efforts in this direction, so, feel free to draft a PR. Unfortunately, I probably won't have much free bandwidth for playing around with this any time soon. In any case, I'd start with the manual tutorial first and then try to make it automatic.

Best,
Carsten

(@ranocha Based on our previous discussions, e.g. https://gist.github.com/ranocha/0ad5716e77e55b2c61cbde10ad4f210c, I think you're also interested in this.)
(There has been a related effort by @vchuravy over at https://github.com/JuliaPerf/Roofline.jl based on linux perf event.)

@antoine-levitt
Copy link
Author

Automating it fully (as in a roofline(f) function) is a bit tricky though

Yeah, it would make sense as a tutorial, since the roofline thing is not really meant as a black box but more as an explanatory thing (at least for me) so you want to see what goes into it.

How to reliably obtain the maximal performance and data bandwidth of the system? Typically, one needs to either put in these values by hand (e.g. derived from data sheets of the CPU) or perform benchmarks to empirically estimate these quantities. But that can be hard and require some fine tuning, in particular for the peak performance (for the max bandwidth we could use STREAMBenchmark.jl).

What's wrong with peakflops()?

Thread pinning

single thread would be a good start.

Having said that, it's still worth trying! I'll happily support any efforts in this direction, so, feel free to draft a PR. Unfortunately, I probably won't have much free bandwidth for playing around with this any time soon. In any case, I'd start with the manual tutorial first and then try to make it automatic.

OK, I'll try to come up with something and you can correct me when I screw up then!

@antoine-levitt
Copy link
Author

Oh I just noticed that I have to start julia in a separate process under likwid then kill julia to get the output. That's pretty annoying (eg I can't put it in a notebook), is there a way around it?

@carstenbauer
Copy link
Member

carstenbauer commented May 6, 2022

Oh I just noticed that I have to start julia in a separate process under likwid then kill julia to get the output. That's pretty annoying (eg I can't put it in a notebook), is there a way around it?

Yes, see https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (although this tutorial may not be in the best shape). Essentially, likwid-perfctr is just a lua script that sets a bunch of environment variable, then calls into the C API, and eventually prints stuff. All of that can be done within a julia session.

(We should probably also introduce some higher-level API to automate this "measure perf counters within a julia session" workflow. Hast hasn't been my focus so far.)

@ranocha
Copy link
Contributor

ranocha commented May 6, 2022

Thanks for pinging me, @carstenbauer. Yes, I also think it's a nice idea. I just followed the LIKWID tutorial to get an empirical roofline model while working on Trixi.jl and our paper on performance stuff.

Here is some code I used for my experiments. Feel free to use it in a PR for the new tutorial (maybe adding me as co-author in GitHub if you like by adding the line Co-authored-by: Hendrik Ranocha <[email protected]> to a commit message):

## gather data for the empirical roofline model

# measure optimistic peakflops (AVX2 FMA or AVX512 FMA if available)
L1_cache_size = LIKWID.get_cpu_topology().cacheLevels[1].size ÷ 1024 # in kB

cpuinfo = LIKWID.get_cpu_info()
if occursin("AVX512", cpuinfo.features)
  likwid_bench_kernel = "peakflops_avx512_fma"
elseif occursin("AVX2", cpuinfo.features)
  likwid_bench_kernel = "peakflops_avx_fma"
else
  likwid_bench_kernel = "peakflops_sse_fma"
end

max_flops_string = read(`likwid-bench -t $likwid_bench_kernel -W N:$(L1_cache_size)kB:1`, String)
max_flops = parse(Float64, match(r"(MFlops/s:\s+)(\d+\.\d+)", max_flops_string).captures[2]) / 1024

# measure optimistic memory bandwidth using reads
if occursin("AVX512", cpuinfo.features)
  likwid_bench_kernel = "load_avx512"
elseif occursin("AVX2", cpuinfo.features)
  likwid_bench_kernel = "load_avx"
else
  likwid_bench_kernel = "load_sse"
end

max_bandwidth_string = read(`likwid-bench -t $likwid_bench_kernel -W N:2GB:1`, String)
max_bandwidth = parse(Float64, match(r"(MByte/s:\s+)(\d+\.\d+)", max_bandwidth_string).captures[2])

## gather data for volume terms implemented in Trixi.jl
measured_string = read(`likwid-perfctr -C 0 -g MEM_DP -m $(Base.julia_cmd()) --check-bounds=no --threads=1 $(joinpath(@__DIR__, "measure_volume_terms.jl"))`, String)

# You can combine different measurements by setting appropriate region names, e.g.,
# NAME_OF_THE_REGION_YOU_USED
offset = findfirst("Region NAME_OF_THE_REGION_YOU_USED", measured_string) |> last
m = match(r"(DP \[MFLOP/s\]\s+\|\s+)(\d+\.\d+)", measured_string, offset)
flops_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2]) / 1024
m = match(r"(Operational intensity\s+\|\s+)(\d+\.\d+)", measured_string, offset)
intensity_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2])
@info "NAME_OF_THE_REGION_YOU_USED" intensity_NAME_OF_THE_REGION_YOU_USED flops_NAME_OF_THE_REGION_YOU_USED

The file measure_volume_terms.jl contains basic setup code such as

Marker.init()

# compile and cool down
compute_a_lot_of_stuff()
sleep(1.0)

# measure
@region "NAME_OF_THE_REGION_YOU_USED" begin
  compute_a_lot_of_stuff()
end

Marker.close()

in a function called once.

@carstenbauer
Copy link
Member

carstenbauer commented May 6, 2022

What's wrong with peakflops()?

Nothing's fundamentally wrong with it, it's a fine starting point / an estimation. But it should be clear that it doesn't give you the true maximal achievable performance (running such a benchmark takes some care: which kernel to use? which parameters, e.g. problem size, to use? etc.) That's why LIKWID has likwid-bench and lots of kernels (see e.g. here) for different CPU types, often defined in machine instructions. Instead of gemm I think a "pure FMAs" kernel would be better suited for estimating the peak flops (LIKWID uses peakflops_avx_fma at least on x86_64).

(Note that we also use pure FMA / WMMA kernels in GPUInspector.jl to measure the peakflops of NVIDIA GPUs.)

@antoine-levitt
Copy link
Author

OK, https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ looks good. How do I get the operational intensity? It doesn't appear in the table at the end.

@TomTheBear
Copy link
Collaborator

The operational intensity is only provided by the MEM_SP and MEM_DP groups directly (example code uses FLOPS_DP and L2). If you want if for caches, you have to calculate it yourself using the FLOP/s count and the bandwidth value.

@carstenbauer
Copy link
Member

Was about to write the same but Thomas beat me to it :)

@antoine-levitt
Copy link
Author

Does this mean that I should do LIKWID.LIKWID_EVENTS("FLOPS_DP|MEM_DP")? It doesn't appear to change anything if I do.

@carstenbauer
Copy link
Member

You can drop the FLOPS_DP part and only use MEM_DP (as they do in the empirical roofline tutorial). If the printed table then doesn't include it, maybe it's just the printing that is flawed?

@antoine-levitt
Copy link
Author

Ooh, OK, not sure what happened but it works with only MEM_DP, thanks.

@TomTheBear
Copy link
Collaborator

TomTheBear commented May 6, 2022

As far as I can see, there is no group switch in the example code at https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/, that's why only the first group gets measured (| is the group delimiter). You would need to run your benchmark code twice, once for the first group, then "switch groups" and run your benchmark again for the second group measurement.
The two groups have different gids, so when reading the results/metrics, you have to specify the right gid.

@carstenbauer
Copy link
Member

As far as I can see, there is no group switch in the example code

Good catch, I will extend the example in this direction when I find time for it.

@carstenbauer
Copy link
Member

carstenbauer commented May 6, 2022

@antoine-levitt As an alternative to the approach in https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (environment variables) you may take a look at https://github.com/JuliaPerf/LIKWID.jl/blob/main/examples/perfmon/perfmon.jl (using LIKWID.PerfMon). This also doesn't seem to have the issue #29.

@antoine-levitt
Copy link
Author

That indeed seems nicer. But if I do that with MEM_DP I get

  "DP [MFLOP/s]"                      => 0.0
  "AVX DP [MFLOP/s]"                  => 0.0
  "Packed [MUOPS/s]"                  => 0.0
  "Scalar [MUOPS/s]"                  => 0.0
  "Memory load bandwidth [MBytes/s]"  => 1164.26
  "Memory load data volume [GBytes]"  => 0.139599
  "Memory evict bandwidth [MBytes/s]" => 324.848
  "Memory evict data volume [GBytes]" => 0.0389503
  "Memory bandwidth [MBytes/s]"       => 1489.11
  "Memory data volume [GBytes]"       => 0.178549
  "Operational intensity"             => 0.0

(in case it was not obvious, I have absolutely no idea what's going on with these codes)

@TomTheBear
Copy link
Collaborator

That might be a valid measurement if the functions don't do any double precision floating-point operations. For single precision, use MEM_SP. Integer operations cannot be counted with hardware performance counters on almost any platforms.

@antoine-levitt
Copy link
Author

No this is a benchmark of double precision matrix multiplication. It might be related to #29 so I'll wait for a fix there before investigating further

@carstenbauer
Copy link
Member

carstenbauer commented May 11, 2022

Why do you think it's related to #29? I would guess that it's wrong thread pinning (or no thread pinning at all). Example (for FLOPS_DP):

With LIKWID.pinthread(0) and LIKWID.PerfMon.init(0) (RIGHT):

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia. example.jl 
OrderedCollections.OrderedDict{String, Float64} with 5 entries:
  "Runtime (RDTSC) [s]" => 0.634277
  "Runtime unhalted [s]" => 0.91243
  "Clock [MHz]" => 3524.29
  "CPI" => 0.520886
  "DP [MFLOP/s]" => 333.282
OrderedCollections.OrderedDict{String, Float64} with 6 entries:
  "ACTUAL_CPU_CLOCK" => 2.23547e9
  "MAX_CPU_CLOCK" => 1.55405e9
  "RETIRED_INSTRUCTIONS" => 4.24566e9
  "CPU_CLOCKS_UNHALTED" => 2.2115e9
  "RETIRED_SSE_AVX_FLOPS_ALL" => 2.11393e8
  "MERGE" => 0.0

With LIKWID.pinthread(1) and LIKWID.PerfMon.init(0) (WRONG):

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia. example.jl 
OrderedCollections.OrderedDict{String, Float64} with 5 entries:
  "Runtime (RDTSC) [s]" => 0.628969
  "Runtime unhalted [s]" => 0.00110777
  "Clock [MHz]" => 2920.05
  "CPI" => NaN
  "DP [MFLOP/s]" => 0.0
OrderedCollections.OrderedDict{String, Float64} with 6 entries:
  "ACTUAL_CPU_CLOCK" => 2.71398e6
  "MAX_CPU_CLOCK" => 2.27706e6
  "RETIRED_INSTRUCTIONS" => 0.0
  "CPU_CLOCKS_UNHALTED" => 0.0
  "RETIRED_SSE_AVX_FLOPS_ALL" => 0.0
  "MERGE" => 0.0

Also note that I have set OPENBLAS_NUM_THREADS=1 here (so that OpenBLAS uses the available Julia thread(s) for linalg operations). If you set OPENBLAS_NUM_THREADS to something > 1 you must make sure to pin the OpenBLAS threads to specific cores and measure the performance counters on these cores as well.

(FYI: I'm on vacation this week and very unresponsive.)

@carstenbauer
Copy link
Member

Just to be a bit more explicit about the multiple threads / cores case: you can, e.g., use LIKWID.PerfMon.init(0:9) to init monitoring on the first 10 cores and then later use LIKWID.PerfMon.get_metric_results(groupid, nth) to query the results for the nth of these cores.

@carstenbauer
Copy link
Member

carstenbauer commented May 11, 2022

Hm, maybe I'm wrong / missing something about the OpenBLAS threads story:

# example.jl
using LIKWID
using LinearAlgebra
nblasthreads = BLAS.get_num_threads()
@show BLAS.get_num_threads()

A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)

LIKWID.pinthread(nblasthreads) # pin Julia thread to first core not occupied by BLAS threads
println("OMP threads on cores 0:$(nblasthreads-1), Julia thread on core $(nblasthreads)")
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)

LIKWID.PerfMon.start_counters()
for _ in 1:10
    mul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()

str = "DP [MFLOP/s]"
for i in 0:3
    mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
    println(str, " (core $i): ", mdict[str])
end

LIKWID.PerfMon.finalize()

Output:

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  OMP_NUM_THREADS=3 OMP_PLACES=cores OMP_PROC_BIND=close julia --project example.jl 
BLAS.get_num_threads() = 3
OMP threads on cores 0:2, Julia thread on core 3
DP [MFLOP/s] (core 0): 0.0
DP [MFLOP/s] (core 1): 0.0
DP [MFLOP/s] (core 2): 0.0
DP [MFLOP/s] (core 3): 8173.052486987061

UPDATE:

Ok, must really be something with OpenBLAS. It works with Octavian.jl (which uses Julia threads):

# octavian.jl
using LIKWID
using Octavian
using ThreadPinning

A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)

LIKWID.pinthreads(0:3)
threadinfo(; color=false)
println()
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)

LIKWID.PerfMon.start_counters()
for _ in 1:10
    matmul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()

str = "DP [MFLOP/s]"
for i in 0:3
    mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
    println(str, " (core $i): ", mdict[str])
end

LIKWID.PerfMon.finalize()

Output:

➜  bauerc@n2lcn0166 LIKWID.jl git:(main)  julia --project -t 4 octavian.jl 

| 0,1,2,3,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ |
| _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,
  _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_ |

# = Julia thread, | = Socket seperator

Julia threads: 4
├ Occupied CPU-threads: 4
└ Mapping (Thread => CPUID): 1 => 0, 2 => 1, 3 => 2, 4 => 3,


DP [MFLOP/s] (core 0): 366.4515143259297
DP [MFLOP/s] (core 1): 372.2449620791664
DP [MFLOP/s] (core 2): 372.24495182826337
DP [MFLOP/s] (core 3): 366.3363018460671

@antoine-levitt
Copy link
Author

Good catch on the thread stuff, I'll take a closer look - this was for benchmarking non-openblas, single thread code.

@carstenbauer
Copy link
Member

FYI: #31

@carstenbauer carstenbauer added documentation Improvements or additions to documentation discussion labels Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants