-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPAS-A restart issue with NVHPC compiler both CPU and GPU #21
Comments
Steps to produce:
The error occurs during the runs started with Error at end of
Note: this was from a test of FHS94 on Derecho with NVHPC and Intel-OneAPI compilers. The Intel-OneAPI build finished both runs. |
@supreethms1809 I think given my CPU-only tests in the comment above, we should change this title to be NVHPC specific. I don't think GPU usage is involved here. |
This is confirmed to still be an issue today. I just ran a F2000climoEW test with the NVHPC v24.3 compilers and the restart run failed. The last output in the
The next lines I would have expected are:
The last output in the
|
This incorporates the already merged tag for EarthWorksOrg/CAM EarthWorksOrg#21
I was able to successfully run a restart run using the nvhpc compiler and mpas dynamical core. It looks like the problem was here When I remove the "endrun" argument the code is able to get past this point and complete the restart run. The problem is occurring because endrun is initially passed in as This occurs all over this file, but as far as I can see, endrun is only executed if an error is encountered, except in this function where it's passed. This is where it looks to be failing with a memory overwrite of 'latCell', I'm not sure why it's ok with other compilers but nvidia does not. I don't know if "removing endrun as an argument" is the correct fix, but it gives us a place to start talking about how we want to fix it. |
Reproducer found here |
Based on info from @cponder, a fix for this issue should come with NVHPC 24.9 next month. |
Issue Description: Earthworks code abruptly stops (without any error message) when we do a restart with MPAS-A as the dynamical core. We are able to narrow down the issue to a subroutine call cam_mpas_update_halo in cam_mpas_subdriver.F90 and further inside cam_mpas_update_halo --> mpas_pool_get_field_info call. More details to come.
we are facing this issue with all Earthworks compsets (FHS94, F2000, QPC6, and fully coupled).
Compiler: nvhpc/23.5
The text was updated successfully, but these errors were encountered: