Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UFS-dev PR#109 #1039

Closed
wants to merge 68 commits into from
Closed

UFS-dev PR#109 #1039

wants to merge 68 commits into from

Conversation

grantfirl
Copy link
Collaborator

No description provided.

gthompsnWRF and others added 30 commits April 27, 2023 07:37
… was increased prev); make fewer explicit rain drop breakup from collisions with graupel when T above 0C; fix so snow/graupel only sublimate when not melting
Rollback changes to rain evaporation
fixed stratosphere warm bias and code optimization for MERRA2
1. Use ice thickness hice(i) to find the level in the lake where ice is
   zero.
2. Do not allow lake temperature to be below freezing point if there is
   no ice.
3. If there is no snow or ice, do not allow surface lake temperature to
   be below freezing point.
   These changes fixed the problem with large errors in the energy budget
   at the beginning of the cold-start run with lakes.
4. Added flag to turn on debug print statements in the CLM lake model.
…te "v3" file), correct misspelled pages "shemes" --> "schemes"
JiliDong-NOAA and others added 28 commits August 23, 2023 15:36
…essor variable

substitutions through the use of new local variables.

The changes in this commit affect 3 main areas of module_sf_mynn.F90:
1.) Subroutine SFCLAY_mynn
2.) Subroutine SFCLAY1D_mynn
3.) Subroutine GFS_zt_wat
Each of these areas are described in more detail below.

1.) SFCLAY_mynn

In the SFCLAY_mynn subroutine, it was possible to remove all #ifdef
substitutions of errmsg(len=*) for errmsg(len=200) because errmsg is not used in
any code regions of this subroutine that may be run on an accelerator device.
Since this is the case, errmsg(len=*) is perfectly acceptable, and can be left
alone. The OpenACC data statements within the subroutine were also updated to
remove references to errmsg as well since, again, it was not necessary to have
errmsg on the device for this subroutine.

2.) SFCLAY1D_mynn

- Creation of device_errmsg and device_errflg and proper syncing with errmsg
  and errflg

In the SFCLAY1D_mynn subroutine, it was also possible to remove all #ifdef
substitutions by instead creating a new local variable called device_errmsg
that is a copy of errmsg but with a fixed size of 512 such that it is acceptable
for use on the device. This is necessary because at certain points in the
subroutine, loops that are good to be offloaded to the device set errmsg under
certain conditions. Since these areas cannot be isolated from the parent loop
without a major rework of the loop, we must preserve a way for errmsg to be set
on the device. Since device_errmsg is a fixed size, we can do that. However,
this complicates the code a bit for error handling purposes as we now have
errmsg and device_errmsg which must be synced properly to ensure error messages
are returned to CCPP as expected. Therefore, we must keep track of when
device_errmsg is set so we can know to sync device_errmsg with errmsg. This is
done by making a new local variable called device_errflg to be device_errmsg's
complement on the device as errflg is errmsg's complement on the host. When
device_errflg is set to a nonzero integer, we then know that device_errmsg must
be synced with errmsg. This is simple to do at the end of the subroutine after
the device_errmsg on the device is copyout-ed by OpenACC, and a new IF-block
has been added for this general case.

- Special case of mid-loop return (line 1417), and the creation of
  device_special_errflg and device_special_errmsg

However, there is a special case we must handle a bit differently. In the
mid-loop return statement near line 1417, we also must perform this sync to
ensure the proper errmsg is returned in the event this return is needed.
Therefore, a similar IF-block has been created within the corresponding #ifdef
near line 2027 to ensure errmsg has the proper value before the subroutine
returns. However, since this block is in the middle of the entire code and
only executed on the host, we must first perform an OpenACC sync operation
to make sure the device_errmsg and the device_errflg on the host matches the
device_errmsg and device_errflg on the host, otherwise the incorrect values
could lead to the return statement not executing as expected.

This special case seems simple, but an extra trap lay exposed. If
device_errmsg and device_errflg is set on the device at any point now before
this IF-block, then the return statement we moved out of the loop will now
be executed for *ANY* error message, whether that was the intended course or
not. Therefore, we need to ensure this special case is only triggered for
this specific case. Unfortunately, there appears no other way than to create
two additional variables (device_special_errmsg and device_special_errflg)
to isolate this case from all other error cases. With these installed in
place of just device_errmsg and device_errflg, this special return case is
now properly handled.

- Complete Ifdef/Ifndef removal not possible

Overall, due to the nature of this special case, we have no choice but to
leave the #ifdef and #ifndef preprocessor statements in place as they are
the only things capable of moving this return statement out of the loop
without additional invasive changes to how the code operates.

3.) GFS_zt_wat

In the GFS_zt_wat subroutine, since this subroutine is called on the
device from within the main I-loop of SFCLAY1D_mynn, we have no choice but
to change all errmsg and errflg usage to device_errmsg or device_errflg,
otherwise this subroutine and the entire parent loop could not be run on
the device. Therefore, all errmsg and errflg lines have been commented out
and new, comparable lines using device_errmsg and device_errflg added in
their place. Additionally, the subroutine call variable list was updated.
… for debug and other conditions.

Original problem:
-----------------

Following feedback that debug information was still desirable for OpenACC device-
executed code where possible, this change removes all preprocessor directives which
were guarding against the compilation of statements which wrote to standard output.
These directives were originally used because debug statements and other standard
output had the potential to greatly reduce performance because of the need to copy over
certain variables from the host to the device just for debug output purposes. Additionally,
when statements were located within parallel-execution regions, the output was not
guaranteed to be presented in any specific order and the additional IF-branches in the
code also would have reduced performance as branching is not efficient when on SIMD
architectures.

Resolutions:
------------

However, with a bit of extra work, a few of these issues are alleviated to allow output to
work again as requested. First, on the data optimization side of the problem, the impact
of pulling in variables just for debugging was minimized by ensuring the data was pulled
in and resident on the GPU for the entire subroutine execution. While this increases the
memory footprint on the device which may have very limited memory, it reduces the data
transfer related performance hit. Next, in the cases where debug output was not within
parallel regions but still needing to be executed on the GPU to show the proper values
at that state of the overall program execution, OpenACC serial regions were used.
These allow the data to not have to be transferred off the GPU mid-execution of the
program just to be shown as debug output and also partially solve the problem of
out-of-order output. Since debug regions are guarded by IF blocks, these serial regions
do not significantly impact performance when debug output is turned off (debug_code=0).
However, slowdown is significant for any other debug-levels which should be acceptable
for debugging situations.

Performance Changes:
--------------------

Overall, these changes accomplish the goal of re-enabling debugging output, but not
completely without a cost. Overall runtime was slightly impacted on the GPU when tested
with 150k and 750k vertical columns (the value of ite used in the i-loops) and debugging
turned off (debug_code=0). For 150k columns, the GPU decreased in speed from the
original baseline of 22ms to 30ms. For 750k columns, the GPU decreased in speed from
the original baseline of 31ms to 70ms. The impact is greater for the larger number of
columns due to the impact of the number of times the mid-loop IF branches are
evaluated on the GPU. While these are slight declines in performance, these are still
significant speedups over the CPU-only tests (8.7x and 18.7x speedups for 150k and
750k, respectively).

Compilation Time Changes:
-------------------------

One additional noted observation regarding performance is compilation time. When all
debug output is disabled (debug_code=0), compilation time is approximately 90 seconds
with the additional serial blocks, IF-branches, and so forth as each of these require more
work from the OpenACC compiler to generate code for the GPU. This problem is
compounded when the debug_code option is increase to either 1 (some debug output)
or 2 (full debug output). At a value of 1, compilation time jumps up to approximately
12.5 minutes on the Hera GPU nodes. At a value of 2, compilation time increases further
to approximately 18.5 minutes on the same GPU nodes. The explanation for this is the
need for the OpenACC compiler to enable greater amounts of serial and branching code
that (again) are less optimal on the GPU and so the compiler must do more work to try
to optimize them as best it can.
add SPP support for G-F deep convection
Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading
Fixes to allow FV3_HRRR_c3 to run with gnu debug plus PR NCAR#113, NCAR#106, and NCAR#103
@grantfirl grantfirl closed this Feb 2, 2024
@grantfirl
Copy link
Collaborator Author

Merged with #1040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants