-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPU exection of atm_rk_integration_setup via OpenACC #1223
base: develop
Are you sure you want to change the base?
Enable GPU exection of atm_rk_integration_setup via OpenACC #1223
Conversation
249a183
to
7badda5
Compare
7badda5
to
40b7b75
Compare
@abishekg7 I tried the changes in this PR, everything works as expected. Though I wonder if we could still use I think mixing the levels of parallelism and the |
@gdicker1 I don't quite think I follow your comment. If my understanding is correct, we found that collapsing loops with different levels of parallelism leads to incorrect results, and the only place where we could collapse vector loops already has a |
I think it might be worth taking a fresh look at the commit message
and PR description. The text about splitting up the loop might not make sense to anyone who doesn't know the history of the porting of the |
@mgduda, I agree that being proscriptive with the parallelism should be best practice. With what I'd call a fully collapse-able loop (e.g. a 2-level loop, with code only in the innermost loop body, and not inside another loop) I am simply unsure what level of parallelism is assigned by the compiler when collapsed. So, that's where my preference to let "the compiler get it right for me" comes from - and it applies just to the fully collapse-able loops. I think the following example would be fine:
( I also would have expected |
40b7b75
to
88859a6
Compare
In my testing, the results differ between the Here is the module setup I'm using on Derecho:
I'm building with |
Yes, I'm seeing some differences too. Let me investigate what I've changed |
I've been experimenting since my previous comment, and I'm still not convinced that the code is incorrect. Simply changing which variables we're copying in a given cell/level nested loop leads to different results. So it would either seem that there's something very subtle that we're doing wrong, or the compiler is generating the wrong GPU code or data movements. |
Yeah this is quite weird. I don't remember changing much after the last time I had verified the results. Probably something subtle. |
- Removing the condition for obtaining num_scalars in subroutine atm_srk3. This condition introduced issues when running the Jablonowski-Williamson dycore case
This commit provides an interim fix for a potential issue in limited area runs relating to the uninitialized garbage cells in the 2nd time level of theta_m. During the OpenACC port of atm_rk_integration_setup, we noticed discrepancies with reference limited area runs when we used an ACC DATA CREATE statement for theta_m_2, instead of an ACC DATA COPYIN. Upon further investigation, these discrepancies were related to tend_theta accessing uninitialized values of theta_m, via flux_arr in the atm_compute_dyn_tend_work subroutine. Later these specific cells are set to the correct values from LBCs in the atm_bdy_adjust_dynamics_speczone_tend subroutine, however, rthdynten is computed between these two blocks of code and does not benefit from the correct LBCs. rthdynten then feeds back to certain convective schemes, and thus altering the final results. The longer-term fix may potentially involve moving the computation of rthdynten to follow the call to atm_bdy_adjust_dynamics_speczone_tend. This commit provides an interim fix by explicitly initializing the garbage cells of theta_m_2.
31d0d97
to
3d01f98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems good to me, except I'm seeing mis-matches for depv_dt_fric
and depv_dt_fric_pv
in the history files when comparing runs of regional testcase.
@mgduda, could you give your thoughts on whether the differences with these variables matter? I'm seeing similar differences here, in PR #1238, and in PR #1241.
Comparing files: f1 - f2
f1=baseline_acc_1735844488/history.2019-09-01_00.06.00.nc
f2=test_att1735694982/history.2019-09-01_00.06.00.nc
Variable Min Max
=========================================
initial_time is not a numeric field and will not be compared
xtime is not a numeric field and will not be compared
depv_dt_fric -32.448421 33.315609
depv_dt_fric_pv -32.448421 30.517973
mminlu is not a numeric field and will not be compared
@gdicker1 The |
@mgduda, Yes and yes. Restarts seemed to match and a closer look (without tolerances) shows bitwise identical. |
This PR enables the GPU execution of
atm_rk_integration_setup
subroutineAn initial OpenACC port of the array assignments in
atm_rk_integration_setup
. Needed to split up ACC loops withgang vector collapse(2)
into separate ACC loop statements to achieve correct results.