groupreduction and subgroupreduction #421

brabreda · 2023-09-09T12:00:25Z

I am unsure why my previous PR closed but here are the changes.

I added docs
I added tests

It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?

removed config include removed using GPUArrays versions

vchuravy · 2023-09-09T19:48:05Z

Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction.

So as an example could we "just" add rval= @reduce(op, val) to perform a workgroup level reduction?

What is rval across all members of a workgroup. Is it the same or may it divergent and only the first workitem will receive a correct answer?

Do we need a broadcast/shuffle operation?

I don't have the answers to those questions.

I think adding @reduce is definitely something we want to do, but subgroup ops may need more thought.

brabreda · 2023-09-09T21:34:47Z

rval=@reduce(op, val) is possible. In this case we only use localmem and @synchronize for the reduction.
at the moment only the first thread of the group returns an item, but I think it would be possible to change this.
I Think providing a neutral value will have better performance, but I am not sure. I Will check this.

I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way.

vchuravy · 2023-09-09T21:38:44Z

Yeah requiring a neutral value sounds good.

Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision.

brabreda · 2023-09-10T01:44:53Z

The problem with that approach is that warp intrinsics don’t work for every type.

We then would need something like this:

__reduce(op, val::union{type1, …},) for wraps
__reduce(op, val,) for everything else

In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this.

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

vchuravy · 2023-09-10T02:08:11Z

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

No it just needs to be documented :) and we might want to add a @broadcast.

So it looks like CUDA.jl makes this decision based on the element type?
https://github.com/JuliaGPU/CUDA.jl/blob/d57e02018ccf00ceda7672f5f4d98e2ceef9106d/src/mapreduce.jl#L177-L181

vchuravy · 2023-09-10T02:08:33Z

src/KernelAbstractions.jl

+"""
+    @subgroupsize()
+
+    returns the GPUs subgroupsize.
+"""
+macro subgroupsize()
+    quote
+        $__subgroupsize()
+    end
+end


Suggested change

"""

@subgroupsize()

returns the GPUs subgroupsize.

"""

macro subgroupsize()

quote

$__subgroupsize()

end

end

vchuravy · 2023-09-10T02:09:41Z

src/reduce.jl

+    idx_in_group = @index(Local)
+    groupsize = @groupsize()[1]
+
+    localmem = @localmem(T, groupsize)


I see if we can do a subgroupreduce the memory we need here is much reduced.

vchuravy · 2023-09-10T02:17:51Z

src/reduce.jl

+- `op`: the operator of the reduction
+- `val`: value that each thread contibutes to the values that need to be reduced
+- `netral`: value of the operator, so that `op(netural, neutral) = neutral``
+- `use_subgroups`: make use of the subgroupreduction of the groupreduction


I see the value of having a common implementation.

So I would define:

@reduce(op, val, neutral) __reduce(op, val, neutral)

And then maybe:

__subgroupreduce & __subgroupsize (No @ version) __can_subgroup_reduce(T) = false

And then we could define:

__reduce(__ctx___, op, val, neutral, ::Val{true}) __reduce(__ctx___, op, val, neutral, ::Val{false})

as you have here.

function __reduce(op, val::T, neutral::T) where T __reduce(op, val, neutral, Val(__can_subgroup_reduce(T))) end

brabreda · 2023-09-10T22:01:59Z

Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible:

For the implementation without warps:

@reduce(op,val,neutral)
__reduce(op,val,neutral)

And every back-end like CUDAKernels just implements something like this:

__reduce(op,val::T,neutral) where {T <: Union{Float32, Float64, etc...}}

It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But,
I tried It today and I get a lot of dynamic invocation errors. But I am not sure what's at fault.

vchuravy · 2023-09-11T21:06:35Z

I recommed using Cthulhu, using CUDA and then using CUDA.device_code_typed interactive=true

…lAbstractions.jl into feature/reduction

brabreda · 2023-09-14T20:31:30Z

Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI.

vchuravy · 2023-09-19T14:29:30Z

Okay so this is API neutral? Pure addition and we don't need anything new?
Great! Otherwise we could have wrapped that in with #422

vchuravy · 2023-09-19T14:35:17Z

src/reduce.jl

+    # perform the reduction
+    d = 1
+    while d < groupsize
+        @synchronize()


Workaround?

…lAbstractions.jl into feature/reduction

brabreda added 2 commits August 14, 2023 13:12

kernel config struct

cc510ac

group- and warpreduce

1c1e459

maleadt marked this pull request as draft September 9, 2023 13:37

fixes

dd3a0ca

vchuravy mentioned this pull request Sep 9, 2023

exposing warp-level semantics #420

Open

brabreda and others added 9 commits September 9, 2023 17:38

add warpsize to config

546e8c9

adding & subgroupreduce

3602808

deps

42a7960

added docs & removed part for GPUArrays

c96a24a

added docs & tests

d2d65be

Remove deps for PR

128a5f0

Ensure NTuple index functions are inlined (#414)

b899685

manifest.toml, reset project.toml

1cdb6d6

removed config include removed using GPUArrays versions

Merge branch 'JuliaGPU:main' into feature/reduction

1fea4cc

vchuravy reviewed Sep 10, 2023

View reviewed changes

brabreda added 2 commits September 14, 2023 22:18

move groupreduce with subgroups to backends

41356d3

Merge branch 'feature/reduction' of https://github.com/brabreda/Kerne…

88662f8

…lAbstractions.jl into feature/reduction

brabreda marked this pull request as ready for review September 14, 2023 20:25

vchuravy reviewed Sep 19, 2023

View reviewed changes

cleanup and wire-up tests

45844ce

This comment was marked as outdated.

Sign in to view

brabreda and others added 3 commits January 11, 2024 21:21

fixup use of groupsize

c5dc356

Merge branch 'feature/reduction' of https://github.com/brabreda/Kerne…

e2c8f84

…lAbstractions.jl into feature/reduction

Merge branch 'JuliaGPU:main' into feature/reduction

700d5f2

brabreda closed this by deleting the head repository Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupreduction and subgroupreduction #421

groupreduction and subgroupreduction #421

brabreda commented Sep 9, 2023

vchuravy commented Sep 9, 2023

brabreda commented Sep 9, 2023

vchuravy commented Sep 9, 2023

brabreda commented Sep 10, 2023 •

edited

Loading

vchuravy commented Sep 10, 2023

vchuravy Sep 10, 2023

vchuravy Sep 10, 2023

vchuravy Sep 10, 2023

brabreda commented Sep 10, 2023 •

edited

Loading

vchuravy commented Sep 11, 2023

brabreda commented Sep 14, 2023

vchuravy commented Sep 19, 2023

vchuravy Sep 19, 2023

maleadt Sep 28, 2023

This comment was marked as outdated.

groupreduction and subgroupreduction #421

groupreduction and subgroupreduction #421

Conversation

brabreda commented Sep 9, 2023

vchuravy commented Sep 9, 2023

brabreda commented Sep 9, 2023

vchuravy commented Sep 9, 2023

brabreda commented Sep 10, 2023 • edited Loading

vchuravy commented Sep 10, 2023

vchuravy Sep 10, 2023

Choose a reason for hiding this comment

vchuravy Sep 10, 2023

Choose a reason for hiding this comment

vchuravy Sep 10, 2023

Choose a reason for hiding this comment

brabreda commented Sep 10, 2023 • edited Loading

vchuravy commented Sep 11, 2023

brabreda commented Sep 14, 2023

vchuravy commented Sep 19, 2023

vchuravy Sep 19, 2023

Choose a reason for hiding this comment

maleadt Sep 28, 2023

Choose a reason for hiding this comment

This comment was marked as outdated.

brabreda commented Sep 10, 2023 •

edited

Loading

brabreda commented Sep 10, 2023 •

edited

Loading