-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for dims
kwarg
#6
Comments
Hi Anton, sorry for the late reply - I've been thinking about the best parallelisation approach for n-dimensional reductions; the lazy way would be to simply run the same reduction steps over each index permutation bar the I am trying to implement a parallelisation approach where each reduction operation done by each thread runs over all index permutations bar Still, the 1D and n-dimensional cases would follow separate codepaths for maximum performance - the n-dimensional case simply needs more scaffolding / memory / indices than the 1D one, and no device-to-host copying (which is needed if If you want to, you can forward the call for the 1D case to AK while I wrestle with shared memory. I will make the functions accept N-dimensional arrays which will be operated over linear indices, like Julia Base reductions without specified |
Thanks for working on this! If you want to use runtime values, you can wrap them in @kernel function ker(x, ::Val{runtime_length}) where {runtime_length}
m = @private Int (runtime_length,)
...
end
ker(ROCBackend())(x, Val(size(x, 2))) However, every time you pass a different value it will recompile the kernel, so probably not good if they change a lot. |
Hi, I think I cracked it! I wrote two parallelisation approaches depending on whether the reduced dimension has fewer elements than the other dimensions combined (e.g. On my Mac M3 I get the following benchmark results: # N-dimensional reduction benchmark against Base
using Metal
using KernelAbstractions
import AcceleratedKernels as AK
using BenchmarkTools
using Random
Random.seed!(0)
function sum_base(s; dims)
d = reduce(+, s; init=zero(eltype(s)), dims=dims)
KernelAbstractions.synchronize(get_backend(s))
d
end
function sum_ak(s; dims)
d = AK.reduce(+, s; init=zero(eltype(s)), dims=dims)
KernelAbstractions.synchronize(get_backend(s))
d
end
# Make array with highly unequal per-axis sizes
s = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
# Correctness
@assert sum_base(s, dims=1) == sum_ak(s, dims=1)
@assert sum_base(s, dims=2) == sum_ak(s, dims=2)
# Benchmarks
println("\nReduction over small axis - AK vs Base")
display(@benchmark sum_ak($s, dims=1))
display(@benchmark sum_base($s, dims=1))
println("\nReduction over long axis - AK vs Base")
display(@benchmark sum_ak($s, dims=2))
display(@benchmark sum_base($s, dims=2))
AK gets slightly better if we set I'd be curious to see benchmarks on a ROCm machine if you have one on hand. |
Great work! I just tested it on AMD GPU with your script, with Reduction over small axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 47.157 μs … 864.506 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 56.875 μs ┊ GC (median): 0.00%
Time (mean ± σ): 71.476 μs ± 31.530 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▆██▆▄▃▃▂▁ ▁▁▃▄▄▄▄▃▂▁▁ ▂
█████████████▇▇▇▇▅▅▅▅▅▆▄▅▄▅▅▅▃▅▆▅▃▅▆▇▇████████████▇▆▆▆▅▃▆▄▄▄ █
47.2 μs Histogram: log(frequency) by time 154 μs <
Memory estimate: 4.12 KiB, allocs estimate: 145.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 99.655 μs … 677.307 ms ┊ GC (min … max): 0.00% … 1.66%
Time (median): 111.988 μs ┊ GC (median): 0.00%
Time (mean ± σ): 193.832 μs ± 6.772 ms ┊ GC (mean ± σ): 0.58% ± 0.02%
▃▅██▇▆▄▂ ▁▂▃▄▄▄▃▂▂▂▂▁▁ ▂
▄▇██████████▆▆▆▆▇▇▆▇▅▆▅▆▅▅▄▄▇▄▄▄▂▄▃▅▅▇█████████████▇▇▇▇▆▅▅▄▅▅ █
99.7 μs Histogram: log(frequency) by time 206 μs <
Memory estimate: 4.06 KiB, allocs estimate: 148.
Reduction over long axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 87.613 μs … 474.281 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 90.619 μs ┊ GC (median): 0.00%
Time (mean ± σ): 91.768 μs ± 7.733 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▂
▂▂▃▅███▆▆▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
87.6 μs Histogram: frequency by time 118 μs <
Memory estimate: 4.03 KiB, allocs estimate: 139.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 76.562 μs … 469.373 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 79.528 μs ┊ GC (median): 0.00%
Time (mean ± σ): 80.609 μs ± 7.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅▁
▂▂▄█████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
76.6 μs Histogram: frequency by time 106 μs <
Memory estimate: 8.67 KiB, allocs estimate: 221. |
We can use occupancy API for optimal |
That's great news, thank you for testing it on AMD. In terms of optimal kernel launches, I still like the generality and simplicity in something like: popt = AK.@tune reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024) and # Not tied to any specific backend, algorithm or API
popt = AK.@tune begin
reduce(f, src, init=init,
block_size=$block_size,
switch_below=$switch_below)
block_size=(64, 128, 256, 512, 1024)
switch_below=(1, 10, 100, 1000, 10000)
end Which could try all permutations in the |
N-dimensional Keeping this issue open until we add an N-dimensional |
Hello, I wonder if it could be possible to have a foreachaxes thing that would iterate on a specific axes similar to for i in axes(A,3)
end something like foreachaxes(A,3) do i
end Maybe this would just mean making function _foreachaxes_gpu(
f,
itr,
n::Int,
backend::GPU;
block_size::Int=256,
)
# GPU implementation
@argcheck block_size > 0
_foreachindex_global!(backend, block_size)(f, axes(itr,n), ndrange=size(itr,n))
nothing
end and the other related ones too, or will there be an issue with the closure f ? |
Hi @yolhan83 , it was quite simple to add - it is available on the However, do keep in mind that on GPUs it is better to have more threads doing less work each (ideally uniformly), than fewer threads doing more work. So if you can process your elements independently, even if you're using a multidimensional array, it's better to use |
Yeah the main purpose I see of adding this was to when most axes are set to some constant and you want to loop on the last ones without making a view but yes better not try to do things in a resulting ndim-1 array inside this loop for sure. Thank you |
Maybe if it leads to people doing too many work by thread this could be changed so that when looping on a multidimensional array, you will have 1 thread per index but and easy way to retrieve the dimensions of the original arrays like a CartesianIndex way of thinking, don't know if I'm clear enough here sorry 😅 may be a foreachCartesianIndex ? The one you added could still be here since it's a different purpose. |
That makes perfect sense.
I think you can use CartesianIndex inside a GPU kernel; you can also access # GPU version with two nested `for` loops linearised
AK.foreachindex(1:nrows * ncols, backend) do i
irow, icol = divrem(i - 1, ncols)
# ... do work with `irow` and `icol`
end |
Base methods, such as
accumulate!
,mapreduce
have support fordims
kwarg.Is there a plan for adding such support here?
We can then replace other kernels from AMDGPU/CUDA with AK implementation.
The text was updated successfully, but these errors were encountered: