Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

middlecoff
Copy link

… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.

An OpenACC directive was added to bl_mynn_common.f90 While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result. The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible. Some of the GPU optimizations were:
Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once. Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel. Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime. Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert. As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU. Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.

… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.

An OpenACC directive was added to bl_mynn_common.f90
While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result.
The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible.
Some of the GPU optimizations were:
Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once.
Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel.
Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime.
Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert.
As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU.
Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.
Copy link
Collaborator

@joeolson42 joeolson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fascinating but the timing is bad. The MYNN is (probably) within hours of being updated. We'll need to work on merging these changes into the updated version. Also, I'm a bit worried about the slowdown for CPUs if I read that right.

@dustinswales
Copy link
Collaborator

This is fascinating but the timing is bad. The MYNN is (probably) within hours of being updated. We'll need to work on merging these changes into the updated version. Also, I'm a bit worried about the slowdown for CPUs if I read that right.

@joeolson42 I didn't want to say this, but yeah this will need some updating after the MYNN stuff is merged into the NCAR authoritative, which will happen after the MYNN changes are merged into the UWM fork.

@yangfanglin
Copy link
Collaborator

Wish someone can provide some background information about this work, and describe the overall strategy for converting the entire CCPP physics package for running on GPU chips. Is this project only for making MYNN EDMF GPU compliant, or this is part of a big project for GPU applications ?

@joeolson42
Copy link
Collaborator

@yangfanglin , I think this was funded by a GSL DDRF (Director's Something Research Funding) project, way back when we still had Dom. It's only a small pot of money around ~$100K for small self-contained projects. As far as I know, there is no funding for this kind of work for all of CCPP, which highlights how NOAA's patchwork funding leaves us scrambling for crumbs.

@yangfanglin
Copy link
Collaborator

@joeolson42 thanks. I agree that NOAA needs to invest more in NWP model code development for GPU applications.

@middlecoff
Copy link
Author

middlecoff commented Mar 24, 2023 via email

@isidorajankov
Copy link

@yangfanglin I can provide little bit of background on this work. Jacques porting of the MYNN PBL to GPUs is related to a larger effort funded by NOAA Software Environments for Novel Architectures (SENA) program. As a part of this effort, Thompson microphysics, GF convective scheme and MYNN surface layer scheme have been ported to GPUs too. These three schemes showed notable improvement in performance on GPUs without degradation in performance on CPUs. Basically, we are targeting a full physics suite port to GPUs. This is also a collaborative project with CCPP team that is working on making CCPP GPU compliant to allow for comprehensive testing of "GPU physics suite". Based on Jacques results, work on GPU-izng MYNN PBL scheme will have to be further evaluated, but I also think it is important to document the progress. I hope this helps.

@ligiabernardet
Copy link
Collaborator

Just a clarification: while the CCPP team thinks it is important to evolve the CCPP Framework to be able to distribute physics to both CPU and GPU, we do not currently have a project/funding to work on this. Depending on what priorities emerge from the upcoming CCPP Visioning Workshop, we may able to pursue this actively.

@joeolson42
Copy link
Collaborator

Sorry @yangfanglin , I guess I was way off on my guess of the funding source. Clearly, I have not been involved in this process.

@yangfanglin
Copy link
Collaborator

yangfanglin commented Mar 24, 2023

Good to know all the facts and activities, but this discussion about GPU probably needs to move to a different place. A more coordinated effort would benefit all parties who are involved in developing and/or using the CCPP. GFDL, NASA/GSFC and DOE/E3SM are also working on converting their codes but taking different approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants