Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

Open
supreethms1809 opened this issue Jan 19, 2024 · 6 comments
Open

MPAS-A restart issue with NVHPC compiler both CPU and GPU #21

supreethms1809 opened this issue Jan 19, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@supreethms1809
Copy link
Contributor

Issue Description: Earthworks code abruptly stops (without any error message) when we do a restart with MPAS-A as the dynamical core. We are able to narrow down the issue to a subroutine call cam_mpas_update_halo in cam_mpas_subdriver.F90 and further inside cam_mpas_update_halo --> mpas_pool_get_field_info call. More details to come.
we are facing this issue with all Earthworks compsets (FHS94, F2000, QPC6, and fully coupled).
Compiler: nvhpc/23.5

@gdicker1
Copy link
Contributor

gdicker1 commented Mar 1, 2024

Steps to produce:

  1. Run create_newcase and request the NVHPC compiler when creating the case
  2. Go into the case and edit some configurations (e.g. STOP_N, DOUT_S, etc)
  3. Edit options to enable a "restart run" - ./xmlchange REST_OPTION=$STOP_OPT,REST_N=$STOP_N,RESUBMIT=1
  4. Run ./case.setup
  5. Run ./case.build
  6. Run ./case.submit

The error occurs during the runs started with ./case.submit: the first run succeeds as expected (e.g. changes run/*.log* files so they end with .gz), but the restart run fails after writing some output to the atm.log.* and cesm.log.* files (I didn't find any content in the other files). The end of cesm.log.* will contain a message about a rank failing from a signal. I suspect this is failing during part of the initialization of the atmosphere, the next lines I would expect in atm.log.* are about variables U, V, Q, and T being set.

Error at end of cesm.log.*:

dec1014.hsn.de.hpc.ucar.edu: rank 4 died from signal 11 

Note: this was from a test of FHS94 on Derecho with NVHPC and Intel-OneAPI compilers. The Intel-OneAPI build finished both runs.
Note: these runs were without GPU flags (i.e. they were CPU-only runs)

@gdicker1
Copy link
Contributor

@supreethms1809 I think given my CPU-only tests in the comment above, we should change this title to be NVHPC specific. I don't think GPU usage is involved here.

@supreethms1809 supreethms1809 changed the title MPAS-A GPU restart issue MPAS-A restart issue with NVHPC compiler both CPU and GPU Mar 18, 2024
@gdicker1 gdicker1 added the bug Something isn't working label May 13, 2024
@gdicker1
Copy link
Contributor

This is confirmed to still be an issue today. I just ran a F2000climoEW test with the NVHPC v24.3 compilers and the restart run failed.

The last output in the atm.log.* file is:

   i MPAS constituent mpas_from_cam_cnst(i)       i CAM constituent  cam_from_mpas_cnst(i)
 ------------------------------------------     ------------------------------------------
   1              qv*                  1          1                Q                  1
# Skipping the other lines from the table
  33            SOAE                  33         33             SOAE                 33
  34            SOAG                  34         34             SOAG                 34
 ------------------------------------------     ------------------------------------------
 * = constituent used as a moisture species in MPAS-A dycore

The next lines I would have expected are:


 vertical coordinate dycore   : Height (z) vertical coordinate
 min/max of meshScalingDel2 = 1.00000000000000 1.00000000000000
 min/max of meshScalingDel4 = 1.00000000000000 1.00000000000000

The last output in the cesm.log* file is:

dec2284.hsn.de.hpc.ucar.edu 11: /var/run/palsd/bc550ddd-1ddc-4351-bdf0-6c58e7d59bb0/files/cpu_bind: line 77: 36589 Segmentation fault      numactl -C         "${ranges[lrank]}" $*
dec2284.hsn.de.hpc.ucar.edu: rank 11 exited with code 139
dec2284.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

gdicker1 added a commit to gdicker1/EarthWorks that referenced this issue Jul 18, 2024
This incorporates the already merged tag for EarthWorksOrg/CAM EarthWorksOrg#21
@sherimickelson
Copy link

I was able to successfully run a restart run using the nvhpc compiler and mpas dynamical core.

It looks like the problem was here
call cam_mpas_update_halo('latCell', endrun)
in
subroutine cam_mpas_read_restart(restart_stream, endrun), cam/src/dynamics/mpas/driver/cam_mpas_subdriver.F90

When I remove the "endrun" argument the code is able to get past this point and complete the restart run.

The problem is occurring because endrun is initially passed in as
use cam_abortutils, only: endrun
but then declared as
procedure(halt_model) :: endrun
in
subroutine cam_mpas_read_restart(restart_stream, endrun)
which calls the subroutine where it fails
subroutine cam_mpas_update_halo(fieldName, endrun)

This occurs all over this file, but as far as I can see, endrun is only executed if an error is encountered, except in this function where it's passed. This is where it looks to be failing with a memory overwrite of 'latCell', I'm not sure why it's ok with other compilers but nvidia does not.

I don't know if "removing endrun as an argument" is the correct fix, but it gives us a place to start talking about how we want to fix it.

@sherimickelson
Copy link

@gdicker1
Copy link
Contributor

Based on info from @cponder, a fix for this issue should come with NVHPC 24.9 next month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants