Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupreduction and subgroupreduction #421

Closed
wants to merge 18 commits into from
Closed

groupreduction and subgroupreduction #421

wants to merge 18 commits into from

Conversation

brabreda
Copy link

@brabreda brabreda commented Sep 9, 2023

I am unsure why my previous PR closed but here are the changes.

  • I added docs
  • I added tests

It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?

@maleadt maleadt marked this pull request as draft September 9, 2023 13:37
@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2023

Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction.

So as an example could we "just" add rval= @reduce(op, val) to perform a workgroup level reduction?

What is rval across all members of a workgroup. Is it the same or may it divergent and only the first workitem will receive a correct answer?

Do we need a broadcast/shuffle operation?

I don't have the answers to those questions.

I think adding @reduce is definitely something we want to do, but subgroup ops may need more thought.

@brabreda
Copy link
Author

brabreda commented Sep 9, 2023

  • rval=@reduce(op, val) is possible. In this case we only use localmem and @synchronize for the reduction.

  • at the moment only the first thread of the group returns an item, but I think it would be possible to change this.

  • I Think providing a neutral value will have better performance, but I am not sure. I Will check this.

I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way.

@vchuravy
Copy link
Member

vchuravy commented Sep 9, 2023

Yeah requiring a neutral value sounds good.

Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision.

@brabreda
Copy link
Author

brabreda commented Sep 10, 2023

The problem with that approach is that warp intrinsics don’t work for every type.

We then would need something like this:

__reduce(op, val::union{type1, …},) for wraps
__reduce(op, val,) for everything else

In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this.

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

@vchuravy
Copy link
Member

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

No it just needs to be documented :) and we might want to add a @broadcast.

So it looks like CUDA.jl makes this decision based on the element type?
https://github.com/JuliaGPU/CUDA.jl/blob/d57e02018ccf00ceda7672f5f4d98e2ceef9106d/src/mapreduce.jl#L177-L181

Comment on lines 145 to 154
"""
@subgroupsize()

returns the GPUs subgroupsize.
"""
macro subgroupsize()
quote
$__subgroupsize()
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
@subgroupsize()
returns the GPUs subgroupsize.
"""
macro subgroupsize()
quote
$__subgroupsize()
end
end

src/reduce.jl Outdated
idx_in_group = @index(Local)
groupsize = @groupsize()[1]

localmem = @localmem(T, groupsize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see if we can do a subgroupreduce the memory we need here is much reduced.

src/reduce.jl Outdated
- `op`: the operator of the reduction
- `val`: value that each thread contibutes to the values that need to be reduced
- `netral`: value of the operator, so that `op(netural, neutral) = neutral``
- `use_subgroups`: make use of the subgroupreduction of the groupreduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the value of having a common implementation.

So I would define:

@reduce(op, val, neutral)
__reduce(op, val, neutral)

And then maybe:

__subgroupreduce & __subgroupsize (No @ version)
__can_subgroup_reduce(T) = false

And then we could define:

__reduce(__ctx___, op, val, neutral, ::Val{true})
__reduce(__ctx___, op, val, neutral, ::Val{false})

as you have here.

function __reduce(op, val::T, neutral::T) where T
     __reduce(op, val, neutral, Val(__can_subgroup_reduce(T)))
end

@brabreda
Copy link
Author

brabreda commented Sep 10, 2023

Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible:

For the implementation without warps:

@reduce(op,val,neutral)
__reduce(op,val,neutral)

And every back-end like CUDAKernels just implements something like this:

__reduce(op,val::T,neutral) where {T <: Union{Float32, Float64, etc...}}

It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But,
I tried It today and I get a lot of dynamic invocation errors. But I am not sure what's at fault.

@vchuravy
Copy link
Member

I recommed using Cthulhu, using CUDA and then using CUDA.device_code_typed interactive=true

@brabreda brabreda marked this pull request as ready for review September 14, 2023 20:25
@brabreda
Copy link
Author

Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI.

@vchuravy
Copy link
Member

Okay so this is API neutral? Pure addition and we don't need anything new?
Great! Otherwise we could have wrapped that in with #422

# perform the reduction
d = 1
while d < groupsize
@synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workaround?

@brabreda

This comment was marked as outdated.

@brabreda brabreda closed this by deleting the head repository Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants