CPU `__thread_run` could loop over CartesianIndices? #448

rafaqz · 2023-12-29T15:52:44Z

I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a Matrix{Bool}) that the indexing in __thread_run takes longer than actually reading and summing the stencil!

It seems to be because the conversion from linear back to cartesian indices is pretty slow. I'm getting 4ns for N=2, 7ns for N=3 and 11ns for N=4 on my laptop. So there is also a penalty to adding dimensions.

Could we switch the loop to iterating over CartesianIndices directly?

I guess it will make dividing up the array a little messier, and might be slower for really large workloads where an even split of tasks is more important than 7ns per operation. It could have a keyword to choose behaviours.

The text was updated successfully, but these errors were encountered:

vchuravy · 2023-12-29T16:07:17Z

Might be interesting, I haven't looked into the execution there too closely.

Do you have a benchmark?

rafaqz · 2023-12-29T16:21:01Z

Just some Stencils.jl profiles on another machine.

But I can write up a PR and we can benchmark it

vchuravy · 2023-12-29T16:28:41Z

If you can contribute it here: https://github.com/JuliaGPU/KernelAbstractions.jl/tree/main/benchmark that would be nice!

rafaqz · 2023-12-29T16:54:04Z

Seems its because my workgroup size was 4 - I guess you're expecting much larger workgroups on CPU?

I never totally got my head around what workgroup size means on CPU when the work is divied up before the workgroup anyway. I was guessing it didn't make much difference what the workgroup size was. But this is a case where it does (very small workloads).

rafaqz · 2023-12-29T17:09:10Z

I guess its kind of academic if you can get around it with large workgroups. But comparing a workgroup 1 and 64:

using KernelAbstractions
kernel1! = copy_kernel!(CPU(), 1)
kernel64! = copy_kernel!(CPU(), 64)
A = rand(16, 16, 16, 16)
B = rand(16, 16, 16, 16)

Benchmarks:

julia> @btime kernel1!(A, B; ndrange=size(A))
  1.799 ms (99 allocations: 6.80 KiB)

julia> @btime kernel64!(A, B; ndrange=size(A))
  439.169 μs (99 allocations: 6.80 KiB)

And you can see the difference in the profile for 1 vs 64 (left vs right) is all integer div from the linear - cartesian conversion.

using ProfileView
@profview for i in 1:100 kernel1!(A, B; ndrange=size(A)) end
@profview for i in 1:100 kernel64!(A, B; ndrange=size(A)) end

vchuravy · 2024-01-14T15:54:16Z

Yeah, for the CPU I often use a workgroupsize of 1024

rafaqz · 2024-01-15T20:02:11Z

I've been wondering if the CPU workgroup size should mean "how much we unroll"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU `__thread_run` could loop over CartesianIndices? #448

CPU `__thread_run` could loop over CartesianIndices? #448

rafaqz commented Dec 29, 2023

vchuravy commented Dec 29, 2023

rafaqz commented Dec 29, 2023

vchuravy commented Dec 29, 2023

rafaqz commented Dec 29, 2023 •

edited

Loading

rafaqz commented Dec 29, 2023 •

edited

Loading

vchuravy commented Jan 14, 2024

rafaqz commented Jan 15, 2024 •

edited

Loading

CPU __thread_run could loop over CartesianIndices? #448

CPU __thread_run could loop over CartesianIndices? #448

Comments

rafaqz commented Dec 29, 2023

vchuravy commented Dec 29, 2023

rafaqz commented Dec 29, 2023

vchuravy commented Dec 29, 2023

rafaqz commented Dec 29, 2023 • edited Loading

rafaqz commented Dec 29, 2023 • edited Loading

vchuravy commented Jan 14, 2024

rafaqz commented Jan 15, 2024 • edited Loading

CPU `__thread_run` could loop over CartesianIndices? #448

CPU `__thread_run` could loop over CartesianIndices? #448

rafaqz commented Dec 29, 2023 •

edited

Loading

rafaqz commented Dec 29, 2023 •

edited

Loading

rafaqz commented Jan 15, 2024 •

edited

Loading