Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model crashed at day 91.75 when compiled with S2S #2320

Open
ShanSunNOAA opened this issue Jun 11, 2024 · 40 comments
Open

model crashed at day 91.75 when compiled with S2S #2320

ShanSunNOAA opened this issue Jun 11, 2024 · 40 comments
Assignees
Labels
bug Something isn't working

Comments

@ShanSunNOAA
Copy link
Collaborator

Description

The UFS model crashed at time step 13320 with error of

341: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
... ...
345: ufs_model.x 0000000005CA5A13 ice_pio_mp_ice_pi 221 ice_pio.F90
345: ufs_model.x 0000000005C85EFE ice_history_write 173 ice_history_write.F90
345: ufs_model.x 0000000005A1225A ice_history_mp_ac 4130 ice_history.F90
345: ufs_model.x 0000000005C7B813 cice_runmod_mp_ci 369 CICE_RunMod.F90
345: ufs_model.x 00000000059BED10 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90
345: ufs_model.x 00000000006A1B08 Unknown Unknown Unknown

see details in /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log

To Reproduce:

The experiment used the following hash of https://github.com/ufs-community/ufs-weather-model
commit 5bec704
Author: Brian Curtis [email protected]
Date: Fri May 31 14:52:06 2024 -0400

Additional context

The model was able to run 9 months successfully with an earlier hash in April:
commit 45c8b2a
Author: JONG KIM [email protected]
Date: Thu Apr 4 19:49:13 2024 -0400

Output

The log containing the error message is available at /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log. I'd be happy to provide more details if needed. Thanks.

@ShanSunNOAA ShanSunNOAA added the bug Something isn't working label Jun 11, 2024
@junwang-noaa
Copy link
Collaborator

@DeniseWorthen may I ask if there is any restriction on cice file names for long forecast time?

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa This doesn't seem to be an issue related to filename length. Files are always named as yyyy-mm-dd-secsofday in the model. The line it is pointing to in ice_pio makes me believe this is an issue more likely to be related to the previous sym-link issue.

Do you have the associated ice_diag.d file? That is the log that ice writes out.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen @junwang-noaa Thanks for the quick responses from both of you. I think I found out why. It is this line in cice_wrapper_mod.F90: write(filename,'(a,i3.3)')'log.ice.f',int(hour)
that need to be (i4.4) to accommodate 4-digit hours. I am testing it now to make sure this is the only change that is needed.

Thanks!

@junwang-noaa
Copy link
Collaborator

It might be the cause. I saw a file 'log.ice.f***' in the run directory. We may still run into problem for climate runs if using "i4.4". In fv3atm, we use:

ndig = max(log10(nf_hours+0.5)+1., 3.)
        write(cform, '("(I",I1,".",I1,")")') ndig, ndig
        write(cfhour, cform) nf_hours

to get the number of digits for forecast hours

@ShanSunNOAA
Copy link
Collaborator Author

@junwang-noaa Thanks for the info. FV3 uses "max" for the forecast hours so the model won't crash. Can we borrow the same formula to get forecast hour in CICE?

@DeniseWorthen
Copy link
Collaborator

Both ATM and CICE write to a specific file name. In MOM, this was just added as a write to stdout. We should probably fix that. For MOM and CICE, we'd need to "find" the number of hours from the driver clock since currently fhmax is only in model_configure.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Unless you're actually compiling runtime statistics (ie, the time for each model advance), then I would just set RunTimeLog = false in the allcomp part of ufs.configure for now.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for your info. I cannot find RunTimeLog in the log file generated by workflow. Where do I set it in the workflow configuration? Thanks!

@junwang-noaa
Copy link
Collaborator

@DeniseWorthen in Fv3ATM, we compute the number of digits from the forecast hour, not fhmax. Can we use "hour" in ufs_logfhour to compute the format of fhour in the log file name? Also it looks to me that runtimelog is not used to control outputting cice log file, I saw the code in nuopc/cmeps/CICE_RunMod.F90.

         if (my_task == master_task) then
            do ns = 1,nstreams
               if (write_history(ns) .and. histfreq(ns) .eq. 'h') then
                  call calendar_sec2hms(msec,hh,mm,ss)
                  write(logmsg,'(6(i4,2x))')myear,mmonth,mday,hh,mm,ss
                  call ufs_logfhour(trim(logmsg),timesecs/c3600)
               end if
            end do
         end if
      end subroutine ice_step

@ShanSunNOAA
Copy link
Collaborator Author

Unfortunately, after modifying the log.ice filename, the model still crashed at day 91.75, with error
340: ufs_model.x 0000000005B97903 ice_pio_mp_ice_pi 221 ice_pio.F90
340: ufs_model.x 0000000005B77DEE ice_history_write 173 ice_history_write.F90
340: ufs_model.x 000000000590414A ice_history_mp_ac 4130 ice_history.F90
340: ufs_model.x 0000000005B6D703 cice_runmod_mp_ci 369 CICE_RunMod.F90
340: ufs_model.x 00000000058B0C00 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90

see /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/logs/2015052100/gfsfcst.log
The ice_diag.d is available at /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/gfs.20150521/00/model_data/ice/history/. The RUNDIRS is available at /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Shan.Sun/stmp2/RUNDIRS/hydro1/gfsfcst.2015052100/fcst.1576533/.

Any suggestions on how to run beyond 3 months would be greatly appreciated. How to turn off RunTimeLog? Thanks.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Sorry for the confusion. Jun is right. The feature for runtime logging as well as the logging to mark the end of the history write was put in at the same time and I didn't think through my answer.

I don't believe CICE has a concept of hours being anything other than 1-24. Let me take a look. In any case, it doesn't seem that is the issue, since your new run failed also. If it is still pointing to the ice_pio line, then I still suspect it has something to do w/ how pio is trying to write the file.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA I've been able to reproduce the error using the low resolution c48-coupled model in the RTs. So in this case, there is no sym-linked files, so that takes one issue out of the mix. In this case, it failed at day 86.

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

Jun suggested a couple of further tests. I'll keep you updated.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thank you for your prompt response and the clever approach. Please let me know if there is anything I can assist with.

@DeniseWorthen
Copy link
Collaborator

I added the following to the cice_wrapper

index d0aafbb..4668f03 100644
--- a/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
+++ b/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
@@ -72,7 +72,7 @@ subroutine ufs_logfhour(msg,hour)
     real(dbl_kind),    intent(in)    :: hour
     character(len=char_len)          :: filename
     integer(int_kind)                :: nunit
-    write(filename,'(a,i3.3)')'log.ice.f',int(hour)
+    write(filename,'(a,i4.4)')'log.ice.f',int(hour)
     open(newunit=nunit,file=trim(filename))
     write(nunit,'(a)')'completed: cice'
     write(nunit,'(a,f10.3)')'forecast hour:',hour

And did the following tests:

  1. Restarted from the above run at fh=2040. The model completed 100d. So I don't think there is any model behaviour that is causing a problem somewhere near ~90 days.
  2. I made a run with CICE history turned off, and the model ran out to 100d. So the issue is definitely in the CICE history write.
  3. I made a run adding the following to the job_card (see https://convergecfd.com/getting-started-guide-v3/environment-configuration-erro.htm)
export I_MPI_SHM_HEAP_VSIZE=16384

and the model ran out to 100days.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for your investigation. Where to adjust or add "I_MPI_SHM_HEAP_VSIZE" when running UFS in the workflow environment, i.e., config.ice or config.fcst? Thanks.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA I don't know where to add this in the workflow. It is not an ice configuration specifically. I don't know where the equivalent of the job card gets generated. For the RTs, I added it just after the OMP_STACKSIZE

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_2767165/dt900.shmheap/job_card

@junwang-noaa
Copy link
Collaborator

Another solution is to use intel 2023.02.

https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537

Hera admin confirmed the Intel/2023.2.1 and impi/2023.2.1 are available on hera, it's worth to try the new version of intel compiler to see if this issue is resolved.

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa This requires a spack-stack to be built w/ these compilers, correct?

@junwang-noaa
Copy link
Collaborator

Yes, a spack-stack issue is created.

@ShanSunNOAA
Copy link
Collaborator Author

Forgot to mention that Jessica suggested adding "export I_MPI_SHM_HEAP_VSIZE=16384" in the HERA.env.
This has been working well for me, using the workflow to run UFS for up to 9 months. Thank you all.

@DeniseWorthen
Copy link
Collaborator

I updated the model by one commit (to the 1.6 SS) and used the updated intel compiler build provided in JCSDA/spack-stack#1147

diff --git a/modulefiles/ufs_hera.intel.lua b/modulefiles/ufs_hera.intel.lua
index 993e0537..d27136ce 100644
--- a/modulefiles/ufs_hera.intel.lua
+++ b/modulefiles/ufs_hera.intel.lua
@@ -2,12 +2,12 @@ help([[
 loads UFS Model prerequisites for Hera/Intel
 ]])

-prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/ue-intel-2023.2.0/install/modulefiles/Core")

-stack_intel_ver=os.getenv("stack_intel_ver") or "2021.5.0"
+stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
 load(pathJoin("stack-intel", stack_intel_ver))

-stack_impi_ver=os.getenv("stack_impi_ver") or "2021.5.1"
+stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
 load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))

 cmake_ver=os.getenv("cmake_ver") or "3.23.1"

I re-ran the cpld c48 case w/o the export I_MPI_SHM_HEAP_VSIZE=16384. The model completed fh 2400, the same as the shmheap test case.

@junwang-noaa
Copy link
Collaborator

Great! So this does resolved the issue! Thanks for testing.

@ShanSunNOAA
Copy link
Collaborator Author

@DeniseWorthen Thanks for testing this solution, without increasing "I_MPI_SHM_HEAP_VSIZE". That is a great news.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jun 28, 2024

I've deleted my previous comment about letting the intel 2023.2 test to run as long as possible. I was using the old executable. I will try a clean test.

@DeniseWorthen
Copy link
Collaborator

I repeated my test using the 2023.2 intel w/o the heap_vsize variable and it completed fh3168 (132 days) before failing w/ the same error

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm-shm_numa_layout[my_numa_node].base_addr

I'll try a second test adding the vsize variable.

@DeniseWorthen
Copy link
Collaborator

The 2023.2 intel case with the addition of the heap_vzsize variable completed fh6816 (284 days) before timing-out.

@junwang-noaa
Copy link
Collaborator

@DeniseWorthen Do we know if setting "I_MPI_SHM_HEAP_VSIZE=16384" only will allow the test case to run through 284 days?

@DeniseWorthen
Copy link
Collaborator

@junwang-noaa That was my next step. I will re-compile w/ the default SS16 and re-run the test.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jul 1, 2024

Note, the tests with the intel 2023.2 also included the addition of a call pio_syncfile(File) in CICE, after the history write.

index b8971a8..22ac8a7 100644
--- a/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
+++ b/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
@@ -1234,6 +1234,8 @@ subroutine ice_write_hist (ns)
       !-----------------------------------------------------------------
       call pio_seterrorhandling(File, PIO_INTERNAL_ERROR)

+      call pio_syncfile(File)
+
       call pio_freedecomp(File,iodesc2d)
       call pio_freedecomp(File,iodesc3dv)
       call pio_freedecomp(File,iodesc3dc)

I had tested this previously with SS16 and w/o the heap variable and it did not resolve the error. I will leave it in for consistency.

@DeniseWorthen
Copy link
Collaborator

I recompiled w/ the default SS16 modules and re-ran using only the heap_vsize variable with the Q wall clock set to the maxiumum=8hours. In this case, the model was unstable and failed with

0: FATAL from PE     0: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
 0:
42: Image              PC                Routine            Line        Source
42: fv3_s2s.dev.exe    0000000004FF02A7  Unknown               Unknown  Unknown
42: fv3_s2s.dev.exe    00000000042C6C09  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
42: fv3_s2s.dev.exe    000000000458EDE2  mpp_efp_mod_mp_mp         195  mpp_efp.F90
42: fv3_s2s.dev.exe    00000000044D56B1  mpp_domains_mod_m         143  mpp_global_sum.fh
42: fv3_s2s.dev.exe    0000000001E319CF  fv_grid_utils_mod        3077  fv_grid_utils.F90
42: fv3_s2s.dev.exe    0000000001E6BC73  fv_mapz_mod_mp_la         794  fv_mapz.F90

The failure occurred at fh2574 (107 days).

@DeniseWorthen
Copy link
Collaborator

Per the conversation in the Scalability meeting today, the previous discussion on locating memory leaks in the system is #779

@DeniseWorthen
Copy link
Collaborator

Using the intel2023.2 compile w/o the heap size variable, I ran for 100days with the ProfileMemory=true setting and plotted the VmPeak value upon exiting the ModelAdvance (for the atm, this was after phase2) relative to the first value (ie, vmpeak(t)/vmpeak(t=0). I plotted every-other PE for ATM (top), OCN (mid), and ICE (bottom)

Screenshot 2024-07-12 at 9 27 22 AM Screenshot 2024-07-12 at 9 27 34 AM Screenshot 2024-07-12 at 9 28 05 AM

@DeniseWorthen
Copy link
Collaborator

I did two tests, one with the ATM C48 test and one w/ the C96 P8 test. The C48 atm test doesn't use MERRA, and uses low res fix files (all .grb). The C96 atm test uses MERRRA and the tiled fix files. Both show the same jump on the 15th of the month. The C48 (left) and C96 (right):

Screenshot 2024-07-15 at 5 34 22 PM Screenshot 2024-07-15 at 5 35 42 PM

@XiaqiongZhou-NOAA
Copy link
Contributor

XiaqiongZhou-NOAA commented Jul 30, 2024

I have a similar problem on WCOSS2. The coupled forecasts always crashed after reaching 135 days on WCOSS2.
export I_MPI_SHM_HEAP_VSIZE=16384 is added in wcoss2.env but it does not help.
The following is the error message:
324344 nid003457.cactus.wcoss2.ncep.noaa.gov 0: in atmos_model update, fhzero= 6.00000000000000 fhour= 3240.000
324345 0.0000000E+00
324346 nid003457.cactus.wcoss2.ncep.noaa.gov 0: zeroing coupling accumulated fields at kdt= 9720
324347 zeroing coupling accumulated fields at kdt= 9720
324348 write out restart at n_atmsteps= 9719 seconds= 11664000
324349 integration length= 3239.667
324350 nid003457.cactus.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 9719 time is 0.106315
324351 aft fcst run output time= 11664000 FBcount= 8 na= 9720
324352 nid004044.cactus.wcoss2.ncep.noaa.gov 373: MPICH Notice [Rank 373] [job id d90a271d-396d-409c-a11c-1a8f6d64b40b] [Mon Jul 29 20:29:51 2024] [nid004044] - Abort(7034 7535) (rank 373 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
324353 PMPI_Allreduce(472)..................: MPI_Allreduce(sbuf=0x7ffe305bd28c, rbuf=0x7ffe305bd29c, count=1, datatype=MPI_INT, op=op=0x98000000, comm=comm=0xc4004298) fa iled
324354 PMPI_Allreduce(457)..................:
324355 MPIR_CRAY_Allreduce(478).............:
324356 MPIDI_CRAY_Setup_Shared_Mem_Coll(838):
324357 MPIR_Comm_dup_impl(51)...............:
324358 MPII_Comm_copy(544)..................:
324359 MPIR_Get_contextid_sparse_group(679).: Too many communicators (0/2048 free on this process; ignore_id=0)

@LarissaReames-NOAA
Copy link

LarissaReames-NOAA commented Sep 12, 2024

As a new member of the SFS team, @junwang-noaa has asked me to look in to this issue. With the help of @DeniseWorthen I was able to get a few test runs in with memory profiling.

I first wanted to make sure this memory issue wasn't present in the hash that @ShanSunNOAA reported as working previously (45c8b2a). Because the module files in that branch are no longer available I used the modules in the current develop branch along with a test using the same intel2023.2.0 installation that Denise used above.. The same memory behavior is present in the C48 ATM test at this commit with both compilers (2021 first, 2023 second) though they aren't identical.
c48_mem
c48_mem_intel2023

These results made me think perhaps this is a library issues associated (somehow) with NetCDF reads. I decided to try a run of the regional_control ATM test which isn't run long enough to read new climo files on the 15th of the month, but it does read lateral boundary condition files (also in NetCDF format) every 180 minutes. We see similar memory jumps for these runs at every LBC read. Again, intel2021 first, 2023 second.
regional_mem
regional_mem_intel2023

Assuming this memory behavior is the source of the long run failures, then it looks like perhaps this is not a code issue if @ShanSunNOAA was able to use this hash successfully in the past but with different modules (and perhaps on non-rocky8 OS on Hera). I have a C48 run going right now with the intel 2023 executable with a full forecast length of 3200 hours to see how far it makes it. I'll report back on that once it's complete and will also open a new issue specific to these memory problems as they seem more pervasive than a single coding bug.

EDIT: Note I am using more than the standard allocation of 8 processes on a single node for the C48 tests to decrease runtime. I'll make sure to switch back to a single node when I check for long run failure.

@LarissaReames-NOAA
Copy link

Based on the test from my previous comment, I decided to start investigating other sources of the Assertion Failed failure message that @ShanSunNOAA originally reported by running S2S regression tests. After lots of unstable forecasts with the cpld_control_c48, I've been able to run several tests using cpld_control_nowave_noaero_p8 in #8a5f711, the first hash after Shan's working version.

After confirming that the Assertion Failed error does not occur with Shan's working hash (it does not), I can confirm that the Assertion Failed issues occur at #8a5f711. When I set CICE_IO = "NetCDF", instead of the default (as used in the failed runs) "PIO" as introduced in WM PR #2145, and restart/history_format='hdf5' or 'default' in ice_in, the failure disappears and the model runs to the 8 hour max wall clock time. As additional tests, I ran again with CICE_IO="PIO" and restart/history_format="pnetcdf1" and "pnetcdf5" and both of these runs also fail. This pretty clearly points to either the PIO library or its implementation as the source of the Assertion Failed errors.

I did one last test this afternoon to see if these are accumulating errors or to see if something specific in various ICs might be triggering it somehow. I left the ICE restart frequency the same (every 12 hours) but doubled the ICE history frequency from 6 to 3 hours. The model failed in 60% of the time (1210 FHR vs 2010 FHR) after writing 504 files (404 history & 100 restart) whereas the 6hr frequency wrote 501 files (335 hist, 167 restart). This suggest that there is some accumulating memory issue that is compounded every time an ICE history or restart file is written.

@DeniseWorthen
Copy link
Collaborator

Following today's infrastructure meeting, I set up the cpld_control_c48 test case on Gaea. I used dt_atmos=900 and set the CICE history write for hourly output. I turned off the restart writing for all components.

The model gave 675 hourly output files (28 days) before failing w/ the following. Rank 17 is one of the CICE ranks (petlist 16 19). This is similar to the message posted above for WCOSS2.

17: MPICH ERROR [Rank 17] [job id 135188771.0] [Mon Oct  7 17:37:30 2024] [c5n1294] - Abort(806965007) (rank 17 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
17:
17: aborting job:
17: Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
srun: error: c5n1294: task 17: Exited with exit code 255

@DeniseWorthen
Copy link
Collaborator

I repeated my test on Gaea using CICE hash 6449f40, which is prior to both PIO related commits (7a4b95e and aca8357). I compiled w/ PIO and used the previous namelist options for using pnetcdf. The model stopped with an identical error I posted above. This indicates that whatever the cause of this failure, it predates the recent updates to PIO in CICE.

@DeniseWorthen
Copy link
Collaborator

@ShanSunNOAA Just an update. The folks at CICE Consortium were able to replicate this in the standalone CICE and were able to commit a fix. We'll be updating CICE soon w/ the fix. See the associated CICE issue 94.

@ShanSunNOAA
Copy link
Collaborator Author

ShanSunNOAA commented Nov 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

5 participants