Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Mar 23, 2023

  1. This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for…

    … the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.
    
    An OpenACC directive was added to bl_mynn_common.f90
    While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result.
    The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible.
    Some of the GPU optimizations were:
    Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once.
    Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel.
    Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime.
    Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert.
    As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU.
    Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.
    middlecoff committed Mar 23, 2023
    Configuration menu
    Copy the full SHA
    67e12cf View commit details
    Browse the repository at this point in the history