-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial: put a function on the roofline model #28
Comments
Hey Antoine, That's a great idea and something I thought about as well at some point. Would be really nice to have. LIKWID itself has an "Empirical Roofline Model" tutorial and I think we should add a similar tutorial to the LIKWID.jl documentation.
Having said that, it's still worth trying! I'll happily support any efforts in this direction, so, feel free to draft a PR. Unfortunately, I probably won't have much free bandwidth for playing around with this any time soon. In any case, I'd start with the manual tutorial first and then try to make it automatic. Best, (@ranocha Based on our previous discussions, e.g. https://gist.github.com/ranocha/0ad5716e77e55b2c61cbde10ad4f210c, I think you're also interested in this.) |
Yeah, it would make sense as a tutorial, since the roofline thing is not really meant as a black box but more as an explanatory thing (at least for me) so you want to see what goes into it.
What's wrong with
single thread would be a good start.
OK, I'll try to come up with something and you can correct me when I screw up then! |
Oh I just noticed that I have to start julia in a separate process under likwid then kill julia to get the output. That's pretty annoying (eg I can't put it in a notebook), is there a way around it? |
Yes, see https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (although this tutorial may not be in the best shape). Essentially, (We should probably also introduce some higher-level API to automate this "measure perf counters within a julia session" workflow. Hast hasn't been my focus so far.) |
Thanks for pinging me, @carstenbauer. Yes, I also think it's a nice idea. I just followed the LIKWID tutorial to get an empirical roofline model while working on Trixi.jl and our paper on performance stuff. Here is some code I used for my experiments. Feel free to use it in a PR for the new tutorial (maybe adding me as co-author in GitHub if you like by adding the line ## gather data for the empirical roofline model
# measure optimistic peakflops (AVX2 FMA or AVX512 FMA if available)
L1_cache_size = LIKWID.get_cpu_topology().cacheLevels[1].size ÷ 1024 # in kB
cpuinfo = LIKWID.get_cpu_info()
if occursin("AVX512", cpuinfo.features)
likwid_bench_kernel = "peakflops_avx512_fma"
elseif occursin("AVX2", cpuinfo.features)
likwid_bench_kernel = "peakflops_avx_fma"
else
likwid_bench_kernel = "peakflops_sse_fma"
end
max_flops_string = read(`likwid-bench -t $likwid_bench_kernel -W N:$(L1_cache_size)kB:1`, String)
max_flops = parse(Float64, match(r"(MFlops/s:\s+)(\d+\.\d+)", max_flops_string).captures[2]) / 1024
# measure optimistic memory bandwidth using reads
if occursin("AVX512", cpuinfo.features)
likwid_bench_kernel = "load_avx512"
elseif occursin("AVX2", cpuinfo.features)
likwid_bench_kernel = "load_avx"
else
likwid_bench_kernel = "load_sse"
end
max_bandwidth_string = read(`likwid-bench -t $likwid_bench_kernel -W N:2GB:1`, String)
max_bandwidth = parse(Float64, match(r"(MByte/s:\s+)(\d+\.\d+)", max_bandwidth_string).captures[2])
## gather data for volume terms implemented in Trixi.jl
measured_string = read(`likwid-perfctr -C 0 -g MEM_DP -m $(Base.julia_cmd()) --check-bounds=no --threads=1 $(joinpath(@__DIR__, "measure_volume_terms.jl"))`, String)
# You can combine different measurements by setting appropriate region names, e.g.,
# NAME_OF_THE_REGION_YOU_USED
offset = findfirst("Region NAME_OF_THE_REGION_YOU_USED", measured_string) |> last
m = match(r"(DP \[MFLOP/s\]\s+\|\s+)(\d+\.\d+)", measured_string, offset)
flops_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2]) / 1024
m = match(r"(Operational intensity\s+\|\s+)(\d+\.\d+)", measured_string, offset)
intensity_NAME_OF_THE_REGION_YOU_USED = parse(Float64, m.captures[2])
@info "NAME_OF_THE_REGION_YOU_USED" intensity_NAME_OF_THE_REGION_YOU_USED flops_NAME_OF_THE_REGION_YOU_USED The file Marker.init()
# compile and cool down
compute_a_lot_of_stuff()
sleep(1.0)
# measure
@region "NAME_OF_THE_REGION_YOU_USED" begin
compute_a_lot_of_stuff()
end
Marker.close() in a function called once. |
Nothing's fundamentally wrong with it, it's a fine starting point / an estimation. But it should be clear that it doesn't give you the true maximal achievable performance (running such a benchmark takes some care: which kernel to use? which parameters, e.g. problem size, to use? etc.) That's why LIKWID has (Note that we also use pure FMA / WMMA kernels in GPUInspector.jl to measure the peakflops of NVIDIA GPUs.) |
OK, https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ looks good. How do I get the operational intensity? It doesn't appear in the table at the end. |
The operational intensity is only provided by the |
Was about to write the same but Thomas beat me to it :) |
Does this mean that I should do |
You can drop the |
Ooh, OK, not sure what happened but it works with only MEM_DP, thanks. |
As far as I can see, there is no group switch in the example code at https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/, that's why only the first group gets measured ( |
Good catch, I will extend the example in this direction when I find time for it. |
@antoine-levitt As an alternative to the approach in https://juliaperf.github.io/LIKWID.jl/dev/examples/perfmon/ (environment variables) you may take a look at https://github.com/JuliaPerf/LIKWID.jl/blob/main/examples/perfmon/perfmon.jl (using |
That indeed seems nicer. But if I do that with MEM_DP I get
(in case it was not obvious, I have absolutely no idea what's going on with these codes) |
That might be a valid measurement if the functions don't do any double precision floating-point operations. For single precision, use |
No this is a benchmark of double precision matrix multiplication. It might be related to #29 so I'll wait for a fix there before investigating further |
Why do you think it's related to #29? I would guess that it's wrong thread pinning (or no thread pinning at all). Example (for With
With
Also note that I have set (FYI: I'm on vacation this week and very unresponsive.) |
Just to be a bit more explicit about the multiple threads / cores case: you can, e.g., use |
Hm, maybe I'm wrong / missing something about the OpenBLAS threads story: # example.jl
using LIKWID
using LinearAlgebra
nblasthreads = BLAS.get_num_threads()
@show BLAS.get_num_threads()
A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)
LIKWID.pinthread(nblasthreads) # pin Julia thread to first core not occupied by BLAS threads
println("OMP threads on cores 0:$(nblasthreads-1), Julia thread on core $(nblasthreads)")
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)
LIKWID.PerfMon.start_counters()
for _ in 1:10
mul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()
str = "DP [MFLOP/s]"
for i in 0:3
mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
println(str, " (core $i): ", mdict[str])
end
LIKWID.PerfMon.finalize() Output:
UPDATE: Ok, must really be something with OpenBLAS. It works with Octavian.jl (which uses Julia threads): # octavian.jl
using LIKWID
using Octavian
using ThreadPinning
A = rand(1000, 1000)
B = rand(1000, 1000)
C = zeros(1000, 1000)
LIKWID.pinthreads(0:3)
threadinfo(; color=false)
println()
LIKWID.PerfMon.init(0:3)
groupid = LIKWID.PerfMon.add_event_set("FLOPS_DP")
LIKWID.PerfMon.setup_counters(groupid)
LIKWID.PerfMon.start_counters()
for _ in 1:10
matmul!(C, A, B)
end
LIKWID.PerfMon.stop_counters()
str = "DP [MFLOP/s]"
for i in 0:3
mdict = LIKWID.PerfMon.get_metric_results(groupid, i)
println(str, " (core $i): ", mdict[str])
end
LIKWID.PerfMon.finalize() Output:
|
Good catch on the thread stuff, I'll take a closer look - this was for benchmarking non-openblas, single thread code. |
FYI: #31 |
Great package! As far as I understand it provides the information needed to locate an application on the roofline model but finding the info without screwing up feels a bit scary to me. Could an example be put in a tutorial or in the library, so that I could do something like
roofline(f)
and it'd plot the roofline model and locate f() on it?(this is for teaching basic performance notions)
The text was updated successfully, but these errors were encountered: