-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALPAKA support #562
Comments
Fantastic idea! Lets ping @psychocoderHPC @BenjaminW3 @j-stephan @bernhardmgruber @sbastrakov @bussmann. Background: The Parallel Reserach Kernels (ParRes Kernels) are a set of simple programs that can be used to explore the features of a parallel platform: https://github.com/ParRes/Kernels |
The context here is that we support a wide range of modern C++ parallel models, including Kokkos, TBB, OpenMP and C++17 Parallel STL, so adding an Alpaka port means people can compare a lot of things at once using tests that were created by people who are relatively objective. https://youtu.be/bXeDfA21-VA shows some examples of things that have been done with them before. I also suspect that the total porting time is less than a day, since the total amount of code that needs porting is very small (nstream is ~3 lines, transpose is ~4 lines, stencil is ~4 lines plus code generation, etc). |
Great idea. Will do our best to support this. Happy Easter holidays from Germany! |
Porting guideDo it in this order:
Look at the Python implementations if you want the easiest-to-read code as a reference. Or look at whatever language you like best. The simplest implementation will be named "kernel.suffix". DetailtransposeIt must use a standard row or column major storage. In distributed memory, you must decompose in only one dimension so the communication is all-to-all. Blocking for cache/TLB is useful on CPUs. GPU optimizations are tricky. The CUDA implementation is not optimal. It will be fixed eventually. stencilFigure out one pattern (e.g. star with radius=2) and then tell me so I can roll it into the code generator. dgemmRead https://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf and implement that if you can but I've never done this and won't judge you at all for just writing triple loops and calling it good. p2pLook at slides 30-37 of https://drive.google.com/file/d/1yNQiG-wjBI4Iu6yDPV6WcQL-r8Yt9RSV/view if it helps to understand the design space. Hyperplane method is probably best on GPU unless you use cooperative groups or do other tricky stuff. |
Hi @jeffhammond - I'm helping with some benchmarks for PIConGPU and Alpaka, so I'll take a look at these kernel ports in more detail. |
btw @jyoung3131 if you want to be a PIC boss, you'll see there is a PIC PRK with a limited number of implementations. @hattom added SOA and AOS versions in Fortran that would be great targets to study with the C++ stuff.
|
LOVE the PIC PRK stuff, @jeffhammond! If one dares to use some more 'experimental' work I recommend looking into combining Alpaka wit Llama to tackle SoA/AoS and other data layout decisions with a single source code. Thanks for looking into this, @jyoung3131 , please coordinate with the Alpaka team, we'll be glad to support this. |
@bussmann you forgot to link llama documentation + llama github |
Add support for https://github.com/alpaka-group/alpaka because we want to support all the C++ programming models.
@ax3l know anybody who can help here? 😉
The text was updated successfully, but these errors were encountered: