Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU exection of atm_rk_integration_setup via OpenACC #1223

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

abishekg7
Copy link
Collaborator

@abishekg7 abishekg7 commented Aug 6, 2024

This PR enables the GPU execution of atm_rk_integration_setup subroutine

An initial OpenACC port of the array assignments in atm_rk_integration_setup. Needed to split up ACC loops with gang vector collapse(2) into separate ACC loop statements to achieve correct results.

@abishekg7 abishekg7 changed the base branch from master to develop August 6, 2024 16:52
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 249a183 to 7badda5 Compare August 6, 2024 17:03
@abishekg7 abishekg7 marked this pull request as ready for review August 6, 2024 17:28
@mgduda mgduda self-requested a review August 6, 2024 17:45
@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Aug 6, 2024
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 7badda5 to 40b7b75 Compare August 6, 2024 17:51
@gdicker1
Copy link
Collaborator

gdicker1 commented Sep 9, 2024

@abishekg7 I tried the changes in this PR, everything works as expected. Though I wonder if we could still use collapse statements?

I think mixing the levels of parallelism and the collapse statements causes issues. I was able to get things to reproduce results by just using acc loop collapse(2) (collapse(3) in one place). From the best practices guide I also think it might be fine to do something like acc loop gang collapse(2) OR acc loop vector collapse(2). My preference is not to state the level of parallelism so the compiler gets it right for me.

@mgduda
Copy link
Contributor

mgduda commented Sep 12, 2024

@gdicker1 I don't quite think I follow your comment. If my understanding is correct, we found that collapsing loops with different levels of parallelism leads to incorrect results, and the only place where we could collapse vector loops already has a collapse(2) clause. In which instances, specifically, would you suggest not explicitly stating the level of parallelism? Perhaps I'm distrustful of compilers, but it seems to me that being explicit about our intent as developers should lead to better outcomes than hoping the compiler will make good decisions.

@mgduda
Copy link
Contributor

mgduda commented Sep 12, 2024

I think it might be worth taking a fresh look at the commit message

    initial OpenACC port of atm_rk_integration_setup
    
    - splitting up the loop gang vector collapse(2) into two separate loops, as it leads
      to erroneous results otherwise

and PR description. The text about splitting up the loop might not make sense to anyone who doesn't know the history of the porting of the atm_rk_integration_setup routine, and so that remark could either be omitted or rewritten. Perhaps we could consider it a true unless stated otherwise, but even still it might also be good to state in the commit message that there are no changes to results.

@gdicker1
Copy link
Collaborator

@gdicker1 I don't quite think I follow your comment...

@mgduda, I agree that being proscriptive with the parallelism should be best practice.

With what I'd call a fully collapse-able loop (e.g. a 2-level loop, with code only in the innermost loop body, and not inside another loop) I am simply unsure what level of parallelism is assigned by the compiler when collapsed. So, that's where my preference to let "the compiler get it right for me" comes from - and it applies just to the fully collapse-able loops.

I think the following example would be fine:

      !$acc loop collapse(2)
      do iEdge = edgeStart,edgeEnd
         do k = 1,nVertLevels
            ru_save(k,iEdge) = ru(k,iEdge)
            u_2(k,iEdge) = u_1(k,iEdge)
         end do
      end do

( I also would have expected !$acc loop gang vector collapse(2) to be correct before this PR, since I think it matches with the build output I got for the code example above: "1663, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x" )

@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 40b7b75 to 88859a6 Compare October 17, 2024 20:21
@mgduda
Copy link
Contributor

mgduda commented Dec 11, 2024

In my testing, the results differ between the develop branch before and after a local merge of this PR. I'm not sure why, but it could be helpful if someone else could see one way or the other whether results are bit-wise identical under this PR.
@abishekg7 and @gdicker1 Would you both be able to try a quick test?

Here is the module setup I'm using on Derecho:

module --force purge
module load ncarenv/23.09
module load craype/2.7.23
module load nvhpc/24.9
module load ncarcompilers/1.0.0
module load gcc-toolchain/13.2.0
module load cray-mpich/8.1.29
module load cuda/12.2.1
module load parallel-netcdf/1.12.3

I'm building with make nvhpc CORE=atmosphere OPENACC=true.

@abishekg7
Copy link
Collaborator Author

Yes, I'm seeing some differences too. Let me investigate what I've changed

@mgduda
Copy link
Contributor

mgduda commented Dec 12, 2024

I've been experimenting since my previous comment, and I'm still not convinced that the code is incorrect. Simply changing which variables we're copying in a given cell/level nested loop leads to different results. So it would either seem that there's something very subtle that we're doing wrong, or the compiler is generating the wrong GPU code or data movements.

@abishekg7
Copy link
Collaborator Author

Yeah this is quite weird. I don't remember changing much after the last time I had verified the results. Probably something subtle.

- Removing the condition for obtaining num_scalars in subroutine atm_srk3. This
  condition introduced issues when running the Jablonowski-Williamson dycore
  case
This commit provides an interim fix for a potential issue in limited area runs
relating to the uninitialized garbage cells in the 2nd time level of theta_m. During
the OpenACC port of atm_rk_integration_setup, we noticed discrepancies with reference
limited area runs when we used an ACC DATA CREATE statement for theta_m_2, instead of an
ACC DATA COPYIN.

Upon further investigation, these discrepancies were related to tend_theta accessing
uninitialized values of theta_m, via flux_arr in the atm_compute_dyn_tend_work subroutine.
Later these specific cells are set to the correct values from LBCs in the
atm_bdy_adjust_dynamics_speczone_tend subroutine, however, rthdynten is computed between
these two blocks of code and does not benefit from the correct LBCs. rthdynten then feeds
back to certain convective schemes, and thus altering the final results.

The longer-term fix may potentially involve moving the computation of rthdynten to follow
the call to atm_bdy_adjust_dynamics_speczone_tend. This commit provides an interim fix
by explicitly initializing the garbage cells of theta_m_2.
@abishekg7 abishekg7 force-pushed the atmosphere/port_atm_rk_integration_setup branch from 31d0d97 to 3d01f98 Compare December 28, 2024 00:24
@mgduda mgduda requested a review from gdicker1 January 2, 2025 20:59
Copy link
Collaborator

@gdicker1 gdicker1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems good to me, except I'm seeing mis-matches for depv_dt_fric and depv_dt_fric_pv in the history files when comparing runs of regional testcase.

@mgduda, could you give your thoughts on whether the differences with these variables matter? I'm seeing similar differences here, in PR #1238, and in PR #1241.

Comparing files: f1 - f2                                                                                                                             
        f1=baseline_acc_1735844488/history.2019-09-01_00.06.00.nc                                                                                    
        f2=test_att1735694982/history.2019-09-01_00.06.00.nc                                                                                         
                                                                                                                                                     
            Variable  Min       Max                                                                                                                  
=========================================                                                                                                            
initial_time is not a numeric field and will not be compared                                                                                         
xtime is not a numeric field and will not be compared                                                                                                
        depv_dt_fric -32.448421  33.315609                                                                                                           
     depv_dt_fric_pv -32.448421  30.517973                                                                                                           
mminlu is not a numeric field and will not be compared

@mgduda
Copy link
Contributor

mgduda commented Jan 2, 2025

@gdicker1 The depv_dt_fric and depv_dt_fric_pv fields are purely diagnostic, and if they're the only fields that are differing, I think we should be fine. Have you compared restart files rather than history files? Are all fields in the restart files bitwise identical?

@gdicker1
Copy link
Collaborator

gdicker1 commented Jan 3, 2025

Have you compared restart files rather than history files? Are all fields in the restart files bitwise identical?

@mgduda, Yes and yes. Restarts seemed to match and a closer look (without tolerances) shows bitwise identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants