… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.
An OpenACC directive was added to bl_mynn_common.f90
While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result.
The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible.
Some of the GPU optimizations were:
Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once.
Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel.
Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime.
Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert.
As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU.
Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.