MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

supreethms1809 · 2024-01-19T20:51:56Z

Issue Description: Earthworks code abruptly stops (without any error message) when we do a restart with MPAS-A as the dynamical core. We are able to narrow down the issue to a subroutine call cam_mpas_update_halo in cam_mpas_subdriver.F90 and further inside cam_mpas_update_halo --> mpas_pool_get_field_info call. More details to come.
we are facing this issue with all Earthworks compsets (FHS94, F2000, QPC6, and fully coupled).
Compiler: nvhpc/23.5

gdicker1 · 2024-03-01T19:15:57Z

Steps to produce:

Run create_newcase and request the NVHPC compiler when creating the case
Go into the case and edit some configurations (e.g. STOP_N, DOUT_S, etc)
Edit options to enable a "restart run" - ./xmlchange REST_OPTION=$STOP_OPT,REST_N=$STOP_N,RESUBMIT=1
Run ./case.setup
Run ./case.build
Run ./case.submit

The error occurs during the runs started with ./case.submit: the first run succeeds as expected (e.g. changes run/*.log* files so they end with .gz), but the restart run fails after writing some output to the atm.log.* and cesm.log.* files (I didn't find any content in the other files). The end of cesm.log.* will contain a message about a rank failing from a signal. I suspect this is failing during part of the initialization of the atmosphere, the next lines I would expect in atm.log.* are about variables U, V, Q, and T being set.

Error at end of cesm.log.*:

dec1014.hsn.de.hpc.ucar.edu: rank 4 died from signal 11

Note: this was from a test of FHS94 on Derecho with NVHPC and Intel-OneAPI compilers. The Intel-OneAPI build finished both runs.
Note: these runs were without GPU flags (i.e. they were CPU-only runs)

gdicker1 · 2024-03-15T16:02:44Z

@supreethms1809 I think given my CPU-only tests in the comment above, we should change this title to be NVHPC specific. I don't think GPU usage is involved here.

gdicker1 · 2024-06-21T18:09:48Z

This is confirmed to still be an issue today. I just ran a F2000climoEW test with the NVHPC v24.3 compilers and the restart run failed.

The last output in the atm.log.* file is:

   i MPAS constituent mpas_from_cam_cnst(i)       i CAM constituent  cam_from_mpas_cnst(i)
 ------------------------------------------     ------------------------------------------
   1              qv*                  1          1                Q                  1
# Skipping the other lines from the table
  33            SOAE                  33         33             SOAE                 33
  34            SOAG                  34         34             SOAG                 34
 ------------------------------------------     ------------------------------------------
 * = constituent used as a moisture species in MPAS-A dycore

The next lines I would have expected are:


 vertical coordinate dycore   : Height (z) vertical coordinate
 min/max of meshScalingDel2 = 1.00000000000000 1.00000000000000
 min/max of meshScalingDel4 = 1.00000000000000 1.00000000000000

The last output in the cesm.log* file is:

dec2284.hsn.de.hpc.ucar.edu 11: /var/run/palsd/bc550ddd-1ddc-4351-bdf0-6c58e7d59bb0/files/cpu_bind: line 77: 36589 Segmentation fault      numactl -C         "${ranges[lrank]}" $*
dec2284.hsn.de.hpc.ucar.edu: rank 11 exited with code 139
dec2284.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

This incorporates the already merged tag for EarthWorksOrg/CAM EarthWorksOrg#21

sherimickelson · 2024-08-02T16:23:28Z

I was able to successfully run a restart run using the nvhpc compiler and mpas dynamical core.

It looks like the problem was here
call cam_mpas_update_halo('latCell', endrun)
in
subroutine cam_mpas_read_restart(restart_stream, endrun), cam/src/dynamics/mpas/driver/cam_mpas_subdriver.F90

When I remove the "endrun" argument the code is able to get past this point and complete the restart run.

The problem is occurring because endrun is initially passed in as
use cam_abortutils, only: endrun
but then declared as
procedure(halt_model) :: endrun
in
subroutine cam_mpas_read_restart(restart_stream, endrun)
which calls the subroutine where it fails
subroutine cam_mpas_update_halo(fieldName, endrun)

This occurs all over this file, but as far as I can see, endrun is only executed if an error is encountered, except in this function where it's passed. This is where it looks to be failing with a memory overwrite of 'latCell', I'm not sure why it's ok with other compilers but nvidia does not.

I don't know if "removing endrun as an argument" is the correct fix, but it gives us a place to start talking about how we want to fix it.

sherimickelson · 2024-08-08T19:26:47Z

Reproducer found here
https://github.com/sherimickelson/cam_mpas_restart_reproducer

gdicker1 · 2024-08-16T14:58:46Z

Based on info from @cponder, a fix for this issue should come with NVHPC 24.9 next month.

supreethms1809 changed the title ~~MPAS-A GPU restart issue~~ MPAS-A restart issue with NVHPC compiler both CPU and GPU Mar 18, 2024

gdicker1 added the bug Something isn't working label May 13, 2024

gdicker1 added a commit to gdicker1/EarthWorks that referenced this issue Jul 18, 2024

Update CAM tag for workaround of cam_dev grids

a806d3c

This incorporates the already merged tag for EarthWorksOrg/CAM EarthWorksOrg#21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

supreethms1809 commented Jan 19, 2024

gdicker1 commented Mar 1, 2024 •

edited

Loading

gdicker1 commented Mar 15, 2024

gdicker1 commented Jun 21, 2024

sherimickelson commented Aug 2, 2024

sherimickelson commented Aug 8, 2024

gdicker1 commented Aug 16, 2024

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

Comments

supreethms1809 commented Jan 19, 2024

gdicker1 commented Mar 1, 2024 • edited Loading

gdicker1 commented Mar 15, 2024

gdicker1 commented Jun 21, 2024

sherimickelson commented Aug 2, 2024

sherimickelson commented Aug 8, 2024

gdicker1 commented Aug 16, 2024

gdicker1 commented Mar 1, 2024 •

edited

Loading