-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing @reduce for group level reduction #379
Conversation
lib/CUDAKernels/src/CUDAKernels.jl
Outdated
threadIdx = KernelAbstractions.@index(Local) | ||
|
||
# shared mem for a complete reduction | ||
shared = KernelAbstractions.@localmem(T, 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is the moment we need dynamic shared memory support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x-ref: #11
lib/CUDAKernels/src/CUDAKernels.jl
Outdated
# perform the reduction | ||
d = 1 | ||
while d < threads | ||
KernelAbstractions.@synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are inside CUDAKernels
here and as such you can use CUDA.jl functionality directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats correct! But a implementation with KA.jl macros would allow for a single implementation that can run on all supported back-end. Because of this I am not sure what the best place is for the code for this implementation.
Also, the main difference between different back-end would the size of local memory but the use of dynamic memory would be a solution to this.
Looks like a great start! Will have to add it to |
To make a more generalized @reduce operation, I would work with a Config struct. An example of this can be found in the GemmKernels.jl Config. Based on this struct, the reduction could use atomics and lane/warp reductions. |
The @reduce macro performs a group level reduction.
TODOs: