Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU execution of atm_recover_large_step_variables via OpenACC #1220

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

gdicker1
Copy link
Collaborator

This PR enables GPU calculation of post-acoustic step variable reconstitution by adding OpenACC directives to the atm_recover_large_step_variables_work routine.

Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer: atm_recover_large_step_variables [ACC_data_xfer]

Invariant fields used in the work routine are copied in during mpas_atm_dynamics_init and deleted in mpas_atm_dynamics_finalize by building off the fields already handled in previous PRs.

@gdicker1
Copy link
Collaborator Author

NOTE: the changes in this PR slightly change answers. The changes seem to come from a couple of loops. In my regional test case, if the calculations of rho_p, rho_zz, rtheta_p, exner, and pressure_p are done on the CPU instead, no differences occur.

First this loop

         do k = 1, nVertLevels
            rho_p(k,iCell) = rho_p_save(k,iCell) + rho_pp(k,iCell)
            rho_zz(k,iCell) = rho_p(k,iCell) + rho_base(k,iCell)
         end do

And in this loop

            do k = 1, nVertLevels
               rtheta_p(k,iCell) = rtheta_p_save(k,iCell) + rtheta_pp(k,iCell) &
                                 - dt * rho_zz(k,iCell) * rt_diabatic_tend(k,iCell)
               theta_m(k,iCell) = (rtheta_p(k,iCell) + rtheta_base(k,iCell))/rho_zz(k,iCell)
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do

@mgduda mgduda self-requested a review July 31, 2024 21:29
@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Jul 31, 2024
@mgduda
Copy link
Contributor

mgduda commented Aug 2, 2024

Although there isn't anything in particular that looks suspect to me, it does seem like the differences in results after the first timestep of a 12-km regional simulation comparing with the current develop branch are a bit larger than I would have expected (based on the differences we see before and after the merge of #1216).

I also get a core dump at the end of the simulation on Derecho when using the following modules:

module --force purge
ml ncarenv/23.09    
ml craype/2.7.23    
ml nvhpc/24.3
ml ncarcompilers/1.0.0
ml cray-mpich/8.1.27
ml parallel-netcdf/1.12.3
ml cuda/12.2.1

It may be worth taking a second look at the changes in this PR to see if we can find anything that's subtly off.

This includes splitting a long do loop on iCell so an if condition on
rk_step is outside the do loop. Though this condition shouldn't lead to
warp divergence, it can still be helpful to do branch evaluation on the
device best suited for it (the CPU). Since there was also limited data
re-use in the long do loop, it may help to reduce kernel launch
overhead.

Also line up whitespace so the bdyMaskCell if condition is apparent and
clean up some lines that end with whitespace.
…work

This commit adds an initial port of this routine using OpenACC. More
changes are needed to improve performance.
This commit ensures the invariant fields used during this work routine
are present on the device from model startup to model shutdown.

It builds on the changes in PR MPAS-Dev#1176 to copyin invariant fields during
mpas_atm_dynamics_init and delete them from the device during
mpas_atm_dynamics_finalize.
These changes ensure that the other, non-invariant, fields are available
on the device during this routine. Some fields that are overwritten are
only created at the beginning, while others are copied in. Any fields
that were unmodified right-hand side fields are deleted at the end and
modified left-hand side fields are copied out.

Timing for these transfers are reported in the output log file in the
new timer: 'atm_recover_large_step_variables [ACC_data_xfer]'.
@gdicker1 gdicker1 force-pushed the atmosphere/acc_recover_large_step branch from 168d6e3 to 4a923ad Compare August 29, 2024 21:42
@gdicker1
Copy link
Collaborator Author

@mgduda, this is ready for another round of review!

There are slight answer differences, but isolated just to exner and pressure_p from what I can tell. No answer differences occur if the code below is added after the loop to calculate rtheta_p, theta_m, exner, and pressure_p (near line 2892 in this PR).

         ! Calculate exner and pressure_p on CPU
         !$acc update host(rtheta_p)
         do iCell=cellStart,cellEnd
            do k = 1, nVertLevels
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do
         end do
         !$acc update device(exner,pressure_p)

@mgduda
Copy link
Contributor

mgduda commented Dec 11, 2024

@gdicker1 If I use the current HEAD of the develop branch (ce4d670) as a baseline, I find that in addition to the code you mention above to compute exner and pressure_p on the host, I also need to add

         rw(1,iCell) = 0.0

and

         rw(nVertLevels+1,iCell) = 0.0

around lines 2860 and 2873 (in this PR branch) in order to get identical results with the baseline.
It makes sense to me that this might be needed, since we create rather than copyin the rw field; but what surprises me is that this isn't needed in the PR branch (if we use 63fdcf7 as the baseline, and don't merge onto the HEAD of the develop branch). Perhaps we are just lucky in getting some zeros in the memory that's given to us on the device?

If you agree that we should be initializing the rw field at the first and last level, can you add those changes to the PR branch?

@gdicker1
Copy link
Collaborator Author

@gdicker1 If I use the current HEAD of the develop branch (ce4d670) as a baseline, I find that in addition to the code you mention above to compute exner and pressure_p on the host, I also need to add ...

I agree with those changes, and will get them pushed. I would assume we were just getting lucky before this point!

…bles_work

Since rw is created on the device in this PR, all values need to be
updated to get correct results.
@gdicker1
Copy link
Collaborator Author

@mgduda this is ready for re-review. Apologies for not pinging after I pushed cde6447.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants