Support for `dims` kwarg #6

pxl-th · 2024-10-16T21:01:06Z

Base methods, such as accumulate!, mapreduce have support for dims kwarg.
Is there a plan for adding such support here?
We can then replace other kernels from AMDGPU/CUDA with AK implementation.

The text was updated successfully, but these errors were encountered:

anicusan · 2024-10-18T17:00:41Z

Hi Anton, sorry for the late reply - I've been thinking about the best parallelisation approach for n-dimensional reductions; the lazy way would be to simply run the same reduction steps over each index permutation bar the dims one over which we're running the reduction - while using the same amount of shared memory as the 1D case, its synchronisation overhead increases with the number of elements. Not great, and it would not be performant for big arrays where the reduced dimension is small.

I am trying to implement a parallelisation approach where each reduction operation done by each thread runs over all index permutations bar dims (at the expense of more shared memory used), but this requires shared memory whose size is dependent on the specific dimension-wise sizes of the input array - which are runtime values, and hence cannot be used to define @private Int (runtime_length,). I need to think a bit more about shared memory sizes...

Still, the 1D and n-dimensional cases would follow separate codepaths for maximum performance - the n-dimensional case simply needs more scaffolding / memory / indices than the 1D one, and no device-to-host copying (which is needed if dims is not specified).

If you want to, you can forward the call for the 1D case to AK while I wrestle with shared memory. I will make the functions accept N-dimensional arrays which will be operated over linear indices, like Julia Base reductions without specified dims.

pxl-th · 2024-10-18T20:29:06Z

Thanks for working on this!

If you want to use runtime values, you can wrap them in Val before passing to the kernel:

@kernel function ker(x, ::Val{runtime_length}) where {runtime_length}
    m = @private Int (runtime_length,)
    ...
end

ker(ROCBackend())(x, Val(size(x, 2)))

However, every time you pass a different value it will recompile the kernel, so probably not good if they change a lot.

anicusan · 2024-11-04T12:59:27Z

Hi, I think I cracked it! I wrote two parallelisation approaches depending on whether the reduced dimension has fewer elements than the other dimensions combined (e.g. reduce(+, rand(3, 1000), dims=1) has only 3 elements in the reduced axis).

On my Mac M3 I get the following benchmark results:

# N-dimensional reduction benchmark against Base
using Metal
using KernelAbstractions
import AcceleratedKernels as AK

using BenchmarkTools
using Random
Random.seed!(0)


function sum_base(s; dims)
    d = reduce(+, s; init=zero(eltype(s)), dims=dims)
    KernelAbstractions.synchronize(get_backend(s))
    d
end


function sum_ak(s; dims)
    d = AK.reduce(+, s; init=zero(eltype(s)), dims=dims)
    KernelAbstractions.synchronize(get_backend(s))
    d
end


# Make array with highly unequal per-axis sizes
s = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))

# Correctness
@assert sum_base(s, dims=1) == sum_ak(s, dims=1)
@assert sum_base(s, dims=2) == sum_ak(s, dims=2)

# Benchmarks
println("\nReduction over small axis - AK vs Base")
display(@benchmark sum_ak($s, dims=1))
display(@benchmark sum_base($s, dims=1))

println("\nReduction over long axis - AK vs Base")
display(@benchmark sum_ak($s, dims=2))
display(@benchmark sum_base($s, dims=2))

Reduction over small axis - AK vs Base
BenchmarkTools.Trial: 5711 samples with 1 evaluation.
 Range (min … max):  237.292 μs … 83.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     444.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   875.866 μs ±  3.540 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▂                                                          ▁
  ███▆▅▅▃▃▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▄▁▁▅▃▁▃▃▄▄▄▃ █
  237 μs        Histogram: log(frequency) by time      23.1 ms <

 Memory estimate: 6.87 KiB, allocs estimate: 230.
BenchmarkTools.Trial: 5296 samples with 1 evaluation.
 Range (min … max):  445.750 μs … 39.186 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     877.438 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   947.043 μs ±  1.070 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂▃▃▆██▆▆▂                                   
  ▁▁▂▂▂▂▃▄▃▂▂▁▂▂▂▂▂▆█████████▆▅▃▂▂▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁ ▃
  446 μs          Histogram: frequency by time         1.55 ms <

 Memory estimate: 6.01 KiB, allocs estimate: 209.

Reduction over long axis - AK vs Base
BenchmarkTools.Trial: 4739 samples with 1 evaluation.
 Range (min … max):  493.708 μs … 80.576 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     822.292 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.049 ms ±  2.197 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▂                                                           
  ▄██▄▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▂▂▂▁▁▂▂▂▁▂▁▁▂▁▁▂▂ ▂
  494 μs          Histogram: frequency by time         11.3 ms <

 Memory estimate: 6.53 KiB, allocs estimate: 228.
BenchmarkTools.Trial: 3242 samples with 1 evaluation.
 Range (min … max):  328.209 μs … 167.401 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     699.604 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.535 ms ±   7.122 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅                                                             
  ███▆▅▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▃▁▃▁▃▁▁▁▁▃▁▁▃▁▃▃▄▄▄▃▁▃▃▃▄▁▁▁▃ █
  328 μs        Histogram: log(frequency) by time       33.7 ms <

 Memory estimate: 11.57 KiB, allocs estimate: 407.

AK gets slightly better if we set block_size=512, but some machines (e.g. Intel Graphics) only allow block_size <= 256, so I left the conservative defaults in.

I'd be curious to see benchmarks on a ROCm machine if you have one on hand.

pxl-th · 2024-11-07T13:57:00Z

Great work! I just tested it on AMD GPU with your script, with block_size=512 performance is definitely better:

Reduction over small axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  47.157 μs … 864.506 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.476 μs ±  31.530 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▄▆██▆▄▃▃▂▁                          ▁▁▃▄▄▄▄▃▂▁▁            ▂
  █████████████▇▇▇▇▅▅▅▅▅▆▄▅▄▅▅▅▃▅▆▅▃▅▆▇▇████████████▇▆▆▆▅▃▆▄▄▄ █
  47.2 μs       Histogram: log(frequency) by time       154 μs <

 Memory estimate: 4.12 KiB, allocs estimate: 145.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   99.655 μs … 677.307 ms  ┊ GC (min … max): 0.00% … 1.66%
 Time  (median):     111.988 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   193.832 μs ±   6.772 ms  ┊ GC (mean ± σ):  0.58% ± 0.02%

     ▃▅██▇▆▄▂                           ▁▂▃▄▄▄▃▂▂▂▂▁▁           ▂
  ▄▇██████████▆▆▆▆▇▇▆▇▅▆▅▆▅▅▄▄▇▄▄▄▂▄▃▅▅▇█████████████▇▇▇▇▆▅▅▄▅▅ █
  99.7 μs       Histogram: log(frequency) by time        206 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 148.

Reduction over long axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  87.613 μs … 474.281 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     90.619 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.768 μs ±   7.733 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃█▂                                                       
  ▂▂▃▅███▆▆▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  87.6 μs         Histogram: frequency by time          118 μs <

 Memory estimate: 4.03 KiB, allocs estimate: 139.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  76.562 μs … 469.373 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     79.528 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   80.609 μs ±   7.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▅█▅▁                                                      
  ▂▂▄█████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  76.6 μs         Histogram: frequency by time          106 μs <

 Memory estimate: 8.67 KiB, allocs estimate: 221.

pxl-th · 2024-11-07T13:58:42Z

We can use occupancy API for optimal block_size before launching the kernel.
But I think it KernelAbstractions.jl needs some work to expose this API.

anicusan · 2024-11-08T22:15:26Z

That's great news, thank you for testing it on AMD. In terms of optimal kernel launches, I still like the generality and simplicity in something like:

popt = AK.@tune reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024)

and

# Not tied to any specific backend, algorithm or API
popt = AK.@tune begin
    reduce(f, src, init=init,
           block_size=$block_size,
           switch_below=$switch_below)
    block_size=(64, 128, 256, 512, 1024)
    switch_below=(1, 10, 100, 1000, 10000)
end

Which could try all permutations in the $ arguments and return the optimal combination. One of these days I should learn Julia macros properly...

anicusan · 2024-11-12T18:38:57Z

N-dimensional reduce and mapreduce implemented and tested.

Keeping this issue open until we add an N-dimensional accumulate too.

yolhan83 · 2024-11-13T08:23:16Z

Hello, I wonder if it could be possible to have a foreachaxes thing that would iterate on a specific axes similar to

for i in axes(A,3)
end

something like

foreachaxes(A,3) do i
end

Maybe this would just mean making

function _foreachaxes_gpu(
    f,
    itr,
    n::Int,
    backend::GPU;

    block_size::Int=256,
)
    # GPU implementation
    @argcheck block_size > 0
    _foreachindex_global!(backend, block_size)(f, axes(itr,n), ndrange=size(itr,n))
    nothing
end

and the other related ones too, or will there be an issue with the closure f ?
Thank you a lot for this package btw.

anicusan · 2024-11-13T17:30:25Z

Hi @yolhan83 , it was quite simple to add - it is available on the AcceleratedKernels#main version on the repo now :)

However, do keep in mind that on GPUs it is better to have more threads doing less work each (ideally uniformly), than fewer threads doing more work. So if you can process your elements independently, even if you're using a multidimensional array, it's better to use foreachindex (which just indexes into each element linearly).

yolhan83 · 2024-11-13T18:11:40Z

Yeah the main purpose I see of adding this was to when most axes are set to some constant and you want to loop on the last ones without making a view but yes better not try to do things in a resulting ndim-1 array inside this loop for sure. Thank you ☺️

yolhan83 · 2024-11-13T18:21:40Z

Maybe if it leads to people doing too many work by thread this could be changed so that when looping on a multidimensional array, you will have 1 thread per index but and easy way to retrieve the dimensions of the original arrays like a CartesianIndex way of thinking, don't know if I'm clear enough here sorry 😅 may be a foreachCartesianIndex ? The one you added could still be here since it's a different purpose.

anicusan · 2024-11-13T23:19:07Z

you want to loop on the last ones without making a view

That makes perfect sense.

have 1 thread per index but and easy way to retrieve the dimensions of the original arrays

I think you can use CartesianIndex inside a GPU kernel; you can also access size and strides if you need to do fancy index trickery manually (I did that in the N-dimensional reduce if you're curious). Or simply, for a 2D array you can do:

# GPU version with two nested `for` loops linearised
AK.foreachindex(1:nrows * ncols, backend) do i
    irow, icol = divrem(i - 1, ncols)
    # ... do work with `irow` and `icol`
end

anicusan added the enhancement New feature or request label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `dims` kwarg #6

Support for `dims` kwarg #6

pxl-th commented Oct 16, 2024

anicusan commented Oct 18, 2024 •

edited

Loading

pxl-th commented Oct 18, 2024 •

edited

Loading

anicusan commented Nov 4, 2024 •

edited

Loading

pxl-th commented Nov 7, 2024

pxl-th commented Nov 7, 2024 •

edited

Loading

anicusan commented Nov 8, 2024

anicusan commented Nov 12, 2024

yolhan83 commented Nov 13, 2024 •

edited

Loading

anicusan commented Nov 13, 2024

yolhan83 commented Nov 13, 2024

yolhan83 commented Nov 13, 2024 •

edited

Loading

anicusan commented Nov 13, 2024 •

edited

Loading

Support for dims kwarg #6

Support for dims kwarg #6

Comments

pxl-th commented Oct 16, 2024

anicusan commented Oct 18, 2024 • edited Loading

pxl-th commented Oct 18, 2024 • edited Loading

anicusan commented Nov 4, 2024 • edited Loading

pxl-th commented Nov 7, 2024

pxl-th commented Nov 7, 2024 • edited Loading

anicusan commented Nov 8, 2024

anicusan commented Nov 12, 2024

yolhan83 commented Nov 13, 2024 • edited Loading

anicusan commented Nov 13, 2024

yolhan83 commented Nov 13, 2024

yolhan83 commented Nov 13, 2024 • edited Loading

anicusan commented Nov 13, 2024 • edited Loading

Support for `dims` kwarg #6

Support for `dims` kwarg #6

anicusan commented Oct 18, 2024 •

edited

Loading

pxl-th commented Oct 18, 2024 •

edited

Loading

anicusan commented Nov 4, 2024 •

edited

Loading

pxl-th commented Nov 7, 2024 •

edited

Loading

yolhan83 commented Nov 13, 2024 •

edited

Loading

yolhan83 commented Nov 13, 2024 •

edited

Loading

anicusan commented Nov 13, 2024 •

edited

Loading