-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU __thread_run
could loop over CartesianIndices?
#448
Comments
Might be interesting, I haven't looked into the execution there too closely. Do you have a benchmark? |
Just some Stencils.jl profiles on another machine. But I can write up a PR and we can benchmark it |
If you can contribute it here: https://github.com/JuliaGPU/KernelAbstractions.jl/tree/main/benchmark that would be nice! |
Seems its because my workgroup size was 4 - I guess you're expecting much larger workgroups on CPU? I never totally got my head around what workgroup size means on CPU when the work is divied up before the workgroup anyway. I was guessing it didn't make much difference what the workgroup size was. But this is a case where it does (very small workloads). |
I guess its kind of academic if you can get around it with large workgroups. But comparing a workgroup 1 and 64: using KernelAbstractions
kernel1! = copy_kernel!(CPU(), 1)
kernel64! = copy_kernel!(CPU(), 64)
A = rand(16, 16, 16, 16)
B = rand(16, 16, 16, 16) Benchmarks: julia> @btime kernel1!(A, B; ndrange=size(A))
1.799 ms (99 allocations: 6.80 KiB)
julia> @btime kernel64!(A, B; ndrange=size(A))
439.169 μs (99 allocations: 6.80 KiB) And you can see the difference in the profile for 1 vs 64 (left vs right) is all integer div from the linear - cartesian conversion. using ProfileView
@profview for i in 1:100 kernel1!(A, B; ndrange=size(A)) end
@profview for i in 1:100 kernel64!(A, B; ndrange=size(A)) end |
Yeah, for the CPU I often use a workgroupsize of 1024 |
I've been wondering if the CPU workgroup size should mean "how much we unroll" |
I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a
Matrix{Bool}
) that the indexing in__thread_run
takes longer than actually reading and summing the stencil!It seems to be because the conversion from linear back to cartesian indices is pretty slow. I'm getting 4ns for N=2, 7ns for N=3 and 11ns for N=4 on my laptop. So there is also a penalty to adding dimensions.
Could we switch the loop to iterating over CartesianIndices directly?
I guess it will make dividing up the array a little messier, and might be slower for really large workloads where an even split of tasks is more important than 7ns per operation. It could have a keyword to choose behaviours.
The text was updated successfully, but these errors were encountered: