Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing @reduce for group level reduction #379

Closed
wants to merge 6 commits into from
Closed

Introducing @reduce for group level reduction #379

wants to merge 6 commits into from

Conversation

brabreda
Copy link

@brabreda brabreda commented Apr 5, 2023

The @reduce macro performs a group level reduction.

TODOs:

  • Figure out a place for the implementation.
  • Add a lane level reduction.
  • Create a more advanced group level reduction that is able to utilize platform dependant feature such as lane reduction and atomics.

threadIdx = KernelAbstractions.@index(Local)

# shared mem for a complete reduction
shared = KernelAbstractions.@localmem(T, 1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is the moment we need dynamic shared memory support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-ref: #11

# perform the reduction
d = 1
while d < threads
KernelAbstractions.@synchronize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are inside CUDAKernels here and as such you can use CUDA.jl functionality directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats correct! But a implementation with KA.jl macros would allow for a single implementation that can run on all supported back-end. Because of this I am not sure what the best place is for the code for this implementation.

Also, the main difference between different back-end would the size of local memory but the use of dynamic memory would be a solution to this.

@vchuravy
Copy link
Member

vchuravy commented Apr 6, 2023

Looks like a great start! Will have to add it to 0.9 but that can happen after you are happy with the initial implementation.

@brabreda
Copy link
Author

brabreda commented Apr 6, 2023

To make a more generalized @reduce operation, I would work with a Config struct. An example of this can be found in the GemmKernels.jl Config.

Based on this struct, the reduction could use atomics and lane/warp reductions.

@brabreda brabreda closed this by deleting the head repository Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants