model crashed at day 91.75 when compiled with S2S #2320

ShanSunNOAA · 2024-06-11T16:40:47Z

Description

The UFS model crashed at time step 13320 with error of

341: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
... ...
345: ufs_model.x 0000000005CA5A13 ice_pio_mp_ice_pi 221 ice_pio.F90
345: ufs_model.x 0000000005C85EFE ice_history_write 173 ice_history_write.F90
345: ufs_model.x 0000000005A1225A ice_history_mp_ac 4130 ice_history.F90
345: ufs_model.x 0000000005C7B813 cice_runmod_mp_ci 369 CICE_RunMod.F90
345: ufs_model.x 00000000059BED10 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90
345: ufs_model.x 00000000006A1B08 Unknown Unknown Unknown

see details in /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log

To Reproduce:

The experiment used the following hash of https://github.com/ufs-community/ufs-weather-model
commit 5bec704
Author: Brian Curtis [email protected]
Date: Fri May 31 14:52:06 2024 -0400

Additional context

The model was able to run 9 months successfully with an earlier hash in April:
commit 45c8b2a
Author: JONG KIM [email protected]
Date: Thu Apr 4 19:49:13 2024 -0400

Output

The log containing the error message is available at /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log. I'd be happy to provide more details if needed. Thanks.

The text was updated successfully, but these errors were encountered:

junwang-noaa · 2024-06-11T16:58:50Z

@DeniseWorthen may I ask if there is any restriction on cice file names for long forecast time?

DeniseWorthen · 2024-06-11T17:21:30Z

@junwang-noaa This doesn't seem to be an issue related to filename length. Files are always named as yyyy-mm-dd-secsofday in the model. The line it is pointing to in ice_pio makes me believe this is an issue more likely to be related to the previous sym-link issue.

Do you have the associated ice_diag.d file? That is the log that ice writes out.

ShanSunNOAA · 2024-06-11T17:57:58Z

@DeniseWorthen @junwang-noaa Thanks for the quick responses from both of you. I think I found out why. It is this line in cice_wrapper_mod.F90: write(filename,'(a,i3.3)')'log.ice.f',int(hour)
that need to be (i4.4) to accommodate 4-digit hours. I am testing it now to make sure this is the only change that is needed.

Thanks!

junwang-noaa · 2024-06-11T18:07:10Z

It might be the cause. I saw a file 'log.ice.f***' in the run directory. We may still run into problem for climate runs if using "i4.4". In fv3atm, we use:

ndig = max(log10(nf_hours+0.5)+1., 3.)
        write(cform, '("(I",I1,".",I1,")")') ndig, ndig
        write(cfhour, cform) nf_hours

to get the number of digits for forecast hours

ShanSunNOAA · 2024-06-11T18:15:35Z

@junwang-noaa Thanks for the info. FV3 uses "max" for the forecast hours so the model won't crash. Can we borrow the same formula to get forecast hour in CICE?

DeniseWorthen · 2024-06-11T19:04:01Z

Both ATM and CICE write to a specific file name. In MOM, this was just added as a write to stdout. We should probably fix that. For MOM and CICE, we'd need to "find" the number of hours from the driver clock since currently fhmax is only in model_configure.

DeniseWorthen · 2024-06-11T20:09:04Z

@ShanSunNOAA Unless you're actually compiling runtime statistics (ie, the time for each model advance), then I would just set RunTimeLog = false in the allcomp part of ufs.configure for now.

ShanSunNOAA · 2024-06-12T00:00:25Z

@DeniseWorthen Thanks for your info. I cannot find RunTimeLog in the log file generated by workflow. Where do I set it in the workflow configuration? Thanks!

junwang-noaa · 2024-06-12T01:22:03Z

@DeniseWorthen in Fv3ATM, we compute the number of digits from the forecast hour, not fhmax. Can we use "hour" in ufs_logfhour to compute the format of fhour in the log file name? Also it looks to me that runtimelog is not used to control outputting cice log file, I saw the code in nuopc/cmeps/CICE_RunMod.F90.

         if (my_task == master_task) then
            do ns = 1,nstreams
               if (write_history(ns) .and. histfreq(ns) .eq. 'h') then
                  call calendar_sec2hms(msec,hh,mm,ss)
                  write(logmsg,'(6(i4,2x))')myear,mmonth,mday,hh,mm,ss
                  call ufs_logfhour(trim(logmsg),timesecs/c3600)
               end if
            end do
         end if
      end subroutine ice_step

ShanSunNOAA · 2024-06-12T05:18:38Z

Unfortunately, after modifying the log.ice filename, the model still crashed at day 91.75, with error
340: ufs_model.x 0000000005B97903 ice_pio_mp_ice_pi 221 ice_pio.F90
340: ufs_model.x 0000000005B77DEE ice_history_write 173 ice_history_write.F90
340: ufs_model.x 000000000590414A ice_history_mp_ac 4130 ice_history.F90
340: ufs_model.x 0000000005B6D703 cice_runmod_mp_ci 369 CICE_RunMod.F90
340: ufs_model.x 00000000058B0C00 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90

see /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/logs/2015052100/gfsfcst.log
The ice_diag.d is available at /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/gfs.20150521/00/model_data/ice/history/. The RUNDIRS is available at /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Shan.Sun/stmp2/RUNDIRS/hydro1/gfsfcst.2015052100/fcst.1576533/.

Any suggestions on how to run beyond 3 months would be greatly appreciated. How to turn off RunTimeLog? Thanks.

DeniseWorthen · 2024-06-12T10:22:13Z

@ShanSunNOAA Sorry for the confusion. Jun is right. The feature for runtime logging as well as the logging to mark the end of the history write was put in at the same time and I didn't think through my answer.

I don't believe CICE has a concept of hours being anything other than 1-24. Let me take a look. In any case, it doesn't seem that is the issue, since your new run failed also. If it is still pointing to the ice_pio line, then I still suspect it has something to do w/ how pio is trying to write the file.

DeniseWorthen · 2024-06-12T19:30:39Z

@ShanSunNOAA I've been able to reproduce the error using the low resolution c48-coupled model in the RTs. So in this case, there is no sym-linked files, so that takes one issue out of the mix. In this case, it failed at day 86.

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

Jun suggested a couple of further tests. I'll keep you updated.

ShanSunNOAA · 2024-06-12T21:21:53Z

@DeniseWorthen Thank you for your prompt response and the clever approach. Please let me know if there is anything I can assist with.

DeniseWorthen · 2024-06-13T12:33:34Z

I added the following to the cice_wrapper

index d0aafbb..4668f03 100644
--- a/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
+++ b/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
@@ -72,7 +72,7 @@ subroutine ufs_logfhour(msg,hour)
     real(dbl_kind),    intent(in)    :: hour
     character(len=char_len)          :: filename
     integer(int_kind)                :: nunit
-    write(filename,'(a,i3.3)')'log.ice.f',int(hour)
+    write(filename,'(a,i4.4)')'log.ice.f',int(hour)
     open(newunit=nunit,file=trim(filename))
     write(nunit,'(a)')'completed: cice'
     write(nunit,'(a,f10.3)')'forecast hour:',hour

And did the following tests:

Restarted from the above run at fh=2040. The model completed 100d. So I don't think there is any model behaviour that is causing a problem somewhere near ~90 days.
I made a run with CICE history turned off, and the model ran out to 100d. So the issue is definitely in the CICE history write.
I made a run adding the following to the job_card (see https://convergecfd.com/getting-started-guide-v3/environment-configuration-erro.htm)

export I_MPI_SHM_HEAP_VSIZE=16384

and the model ran out to 100days.

ShanSunNOAA · 2024-06-13T13:59:05Z

@DeniseWorthen Thanks for your investigation. Where to adjust or add "I_MPI_SHM_HEAP_VSIZE" when running UFS in the workflow environment, i.e., config.ice or config.fcst? Thanks.

DeniseWorthen · 2024-06-13T14:34:27Z

@ShanSunNOAA I don't know where to add this in the workflow. It is not an ice configuration specifically. I don't know where the equivalent of the job card gets generated. For the RTs, I added it just after the OMP_STACKSIZE

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_2767165/dt900.shmheap/job_card

junwang-noaa · 2024-06-17T15:16:54Z

Another solution is to use intel 2023.02.

https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537

Hera admin confirmed the Intel/2023.2.1 and impi/2023.2.1 are available on hera, it's worth to try the new version of intel compiler to see if this issue is resolved.

DeniseWorthen · 2024-06-17T15:26:25Z

@junwang-noaa This requires a spack-stack to be built w/ these compilers, correct?

junwang-noaa · 2024-06-17T15:31:43Z

Yes, a spack-stack issue is created.

ShanSunNOAA · 2024-06-18T17:29:08Z

Forgot to mention that Jessica suggested adding "export I_MPI_SHM_HEAP_VSIZE=16384" in the HERA.env.
This has been working well for me, using the workflow to run UFS for up to 9 months. Thank you all.

DeniseWorthen · 2024-06-27T21:02:35Z

I updated the model by one commit (to the 1.6 SS) and used the updated intel compiler build provided in JCSDA/spack-stack#1147

diff --git a/modulefiles/ufs_hera.intel.lua b/modulefiles/ufs_hera.intel.lua
index 993e0537..d27136ce 100644
--- a/modulefiles/ufs_hera.intel.lua
+++ b/modulefiles/ufs_hera.intel.lua
@@ -2,12 +2,12 @@ help([[
 loads UFS Model prerequisites for Hera/Intel
 ]])

-prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/ue-intel-2023.2.0/install/modulefiles/Core")

-stack_intel_ver=os.getenv("stack_intel_ver") or "2021.5.0"
+stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
 load(pathJoin("stack-intel", stack_intel_ver))

-stack_impi_ver=os.getenv("stack_impi_ver") or "2021.5.1"
+stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
 load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))

 cmake_ver=os.getenv("cmake_ver") or "3.23.1"

I re-ran the cpld c48 case w/o the export I_MPI_SHM_HEAP_VSIZE=16384. The model completed fh 2400, the same as the shmheap test case.

junwang-noaa · 2024-06-28T00:35:21Z

Great! So this does resolved the issue! Thanks for testing.

ShanSunNOAA · 2024-06-28T20:21:49Z

@DeniseWorthen Thanks for testing this solution, without increasing "I_MPI_SHM_HEAP_VSIZE". That is a great news.

DeniseWorthen · 2024-06-28T20:39:27Z

I've deleted my previous comment about letting the intel 2023.2 test to run as long as possible. I was using the old executable. I will try a clean test.

DeniseWorthen · 2024-06-29T12:23:04Z

I repeated my test using the 2023.2 intel w/o the heap_vsize variable and it completed fh3168 (132 days) before failing w/ the same error

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm-shm_numa_layout[my_numa_node].base_addr

I'll try a second test adding the vsize variable.

DeniseWorthen · 2024-06-30T17:24:19Z

The 2023.2 intel case with the addition of the heap_vzsize variable completed fh6816 (284 days) before timing-out.

junwang-noaa · 2024-07-01T12:00:58Z

@DeniseWorthen Do we know if setting "I_MPI_SHM_HEAP_VSIZE=16384" only will allow the test case to run through 284 days?

DeniseWorthen · 2024-07-01T12:42:28Z

@junwang-noaa That was my next step. I will re-compile w/ the default SS16 and re-run the test.

DeniseWorthen · 2024-07-01T13:36:49Z

Note, the tests with the intel 2023.2 also included the addition of a call pio_syncfile(File) in CICE, after the history write.

index b8971a8..22ac8a7 100644
--- a/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
+++ b/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
@@ -1234,6 +1234,8 @@ subroutine ice_write_hist (ns)
       !-----------------------------------------------------------------
       call pio_seterrorhandling(File, PIO_INTERNAL_ERROR)

+      call pio_syncfile(File)
+
       call pio_freedecomp(File,iodesc2d)
       call pio_freedecomp(File,iodesc3dv)
       call pio_freedecomp(File,iodesc3dc)

I had tested this previously with SS16 and w/o the heap variable and it did not resolve the error. I will leave it in for consistency.

DeniseWorthen · 2024-07-01T19:11:27Z

I recompiled w/ the default SS16 modules and re-ran using only the heap_vsize variable with the Q wall clock set to the maxiumum=8hours. In this case, the model was unstable and failed with

0: FATAL from PE     0: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
 0:
42: Image              PC                Routine            Line        Source
42: fv3_s2s.dev.exe    0000000004FF02A7  Unknown               Unknown  Unknown
42: fv3_s2s.dev.exe    00000000042C6C09  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
42: fv3_s2s.dev.exe    000000000458EDE2  mpp_efp_mod_mp_mp         195  mpp_efp.F90
42: fv3_s2s.dev.exe    00000000044D56B1  mpp_domains_mod_m         143  mpp_global_sum.fh
42: fv3_s2s.dev.exe    0000000001E319CF  fv_grid_utils_mod        3077  fv_grid_utils.F90
42: fv3_s2s.dev.exe    0000000001E6BC73  fv_mapz_mod_mp_la         794  fv_mapz.F90

The failure occurred at fh2574 (107 days).

DeniseWorthen · 2024-07-11T01:13:27Z

Per the conversation in the Scalability meeting today, the previous discussion on locating memory leaks in the system is #779

DeniseWorthen · 2024-07-12T13:34:32Z

Using the intel2023.2 compile w/o the heap size variable, I ran for 100days with the ProfileMemory=true setting and plotted the VmPeak value upon exiting the ModelAdvance (for the atm, this was after phase2) relative to the first value (ie, vmpeak(t)/vmpeak(t=0). I plotted every-other PE for ATM (top), OCN (mid), and ICE (bottom)

DeniseWorthen · 2024-07-15T21:41:08Z

I did two tests, one with the ATM C48 test and one w/ the C96 P8 test. The C48 atm test doesn't use MERRA, and uses low res fix files (all .grb). The C96 atm test uses MERRRA and the tiled fix files. Both show the same jump on the 15th of the month. The C48 (left) and C96 (right):

XiaqiongZhou-NOAA · 2024-07-30T17:42:07Z

I have a similar problem on WCOSS2. The coupled forecasts always crashed after reaching 135 days on WCOSS2.
export I_MPI_SHM_HEAP_VSIZE=16384 is added in wcoss2.env but it does not help.
The following is the error message:
324344 nid003457.cactus.wcoss2.ncep.noaa.gov 0: in atmos_model update, fhzero= 6.00000000000000 fhour= 3240.000
324345 0.0000000E+00
324346 nid003457.cactus.wcoss2.ncep.noaa.gov 0: zeroing coupling accumulated fields at kdt= 9720
324347 zeroing coupling accumulated fields at kdt= 9720
324348 write out restart at n_atmsteps= 9719 seconds= 11664000
324349 integration length= 3239.667
324350 nid003457.cactus.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 9719 time is 0.106315
324351 aft fcst run output time= 11664000 FBcount= 8 na= 9720
324352 nid004044.cactus.wcoss2.ncep.noaa.gov 373: MPICH Notice [Rank 373] [job id d90a271d-396d-409c-a11c-1a8f6d64b40b] [Mon Jul 29 20:29:51 2024] [nid004044] - Abort(7034 7535) (rank 373 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
324353 PMPI_Allreduce(472)..................: MPI_Allreduce(sbuf=0x7ffe305bd28c, rbuf=0x7ffe305bd29c, count=1, datatype=MPI_INT, op=op=0x98000000, comm=comm=0xc4004298) fa iled
324354 PMPI_Allreduce(457)..................:
324355 MPIR_CRAY_Allreduce(478).............:
324356 MPIDI_CRAY_Setup_Shared_Mem_Coll(838):
324357 MPIR_Comm_dup_impl(51)...............:
324358 MPII_Comm_copy(544)..................:
324359 MPIR_Get_contextid_sparse_group(679).: Too many communicators (0/2048 free on this process; ignore_id=0)

LarissaReames-NOAA · 2024-09-12T17:16:31Z

As a new member of the SFS team, @junwang-noaa has asked me to look in to this issue. With the help of @DeniseWorthen I was able to get a few test runs in with memory profiling.

I first wanted to make sure this memory issue wasn't present in the hash that @ShanSunNOAA reported as working previously (45c8b2a). Because the module files in that branch are no longer available I used the modules in the current develop branch along with a test using the same intel2023.2.0 installation that Denise used above.. The same memory behavior is present in the C48 ATM test at this commit with both compilers (2021 first, 2023 second) though they aren't identical.

These results made me think perhaps this is a library issues associated (somehow) with NetCDF reads. I decided to try a run of the regional_control ATM test which isn't run long enough to read new climo files on the 15th of the month, but it does read lateral boundary condition files (also in NetCDF format) every 180 minutes. We see similar memory jumps for these runs at every LBC read. Again, intel2021 first, 2023 second.

Assuming this memory behavior is the source of the long run failures, then it looks like perhaps this is not a code issue if @ShanSunNOAA was able to use this hash successfully in the past but with different modules (and perhaps on non-rocky8 OS on Hera). I have a C48 run going right now with the intel 2023 executable with a full forecast length of 3200 hours to see how far it makes it. I'll report back on that once it's complete and will also open a new issue specific to these memory problems as they seem more pervasive than a single coding bug.

EDIT: Note I am using more than the standard allocation of 8 processes on a single node for the C48 tests to decrease runtime. I'll make sure to switch back to a single node when I check for long run failure.

LarissaReames-NOAA · 2024-09-25T20:03:32Z

Based on the test from my previous comment, I decided to start investigating other sources of the Assertion Failed failure message that @ShanSunNOAA originally reported by running S2S regression tests. After lots of unstable forecasts with the cpld_control_c48, I've been able to run several tests using cpld_control_nowave_noaero_p8 in #8a5f711, the first hash after Shan's working version.

After confirming that the Assertion Failed error does not occur with Shan's working hash (it does not), I can confirm that the Assertion Failed issues occur at #8a5f711. When I set CICE_IO = "NetCDF", instead of the default (as used in the failed runs) "PIO" as introduced in WM PR #2145, and restart/history_format='hdf5' or 'default' in ice_in, the failure disappears and the model runs to the 8 hour max wall clock time. As additional tests, I ran again with CICE_IO="PIO" and restart/history_format="pnetcdf1" and "pnetcdf5" and both of these runs also fail. This pretty clearly points to either the PIO library or its implementation as the source of the Assertion Failed errors.

I did one last test this afternoon to see if these are accumulating errors or to see if something specific in various ICs might be triggering it somehow. I left the ICE restart frequency the same (every 12 hours) but doubled the ICE history frequency from 6 to 3 hours. The model failed in 60% of the time (1210 FHR vs 2010 FHR) after writing 504 files (404 history & 100 restart) whereas the 6hr frequency wrote 501 files (335 hist, 167 restart). This suggest that there is some accumulating memory issue that is compounded every time an ICE history or restart file is written.

DeniseWorthen · 2024-10-07T21:51:26Z

Following today's infrastructure meeting, I set up the cpld_control_c48 test case on Gaea. I used dt_atmos=900 and set the CICE history write for hourly output. I turned off the restart writing for all components.

The model gave 675 hourly output files (28 days) before failing w/ the following. Rank 17 is one of the CICE ranks (petlist 16 19). This is similar to the message posted above for WCOSS2.

17: MPICH ERROR [Rank 17] [job id 135188771.0] [Mon Oct  7 17:37:30 2024] [c5n1294] - Abort(806965007) (rank 17 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
17:
17: aborting job:
17: Fatal error in PMPI_Comm_split: Other MPI error, error stack:
17: PMPI_Comm_split(513)................: MPI_Comm_split(comm=0xc400314e, color=1, key=0, new_comm=0x7ffe5bbb2d74) failed
17: PMPI_Comm_split(494)................:
17: MPIR_Comm_split_impl(268)...........:
17: MPIR_Get_contextid_sparse_group(610): Too many communicators (0/2048 free on this process; ignore_id=0)
srun: error: c5n1294: task 17: Exited with exit code 255

DeniseWorthen · 2024-10-12T19:12:45Z

I repeated my test on Gaea using CICE hash 6449f40, which is prior to both PIO related commits (7a4b95e and aca8357). I compiled w/ PIO and used the previous namelist options for using pnetcdf. The model stopped with an identical error I posted above. This indicates that whatever the cause of this failure, it predates the recent updates to PIO in CICE.

DeniseWorthen · 2024-11-15T23:21:02Z

@ShanSunNOAA Just an update. The folks at CICE Consortium were able to replicate this in the standalone CICE and were able to commit a fix. We'll be updating CICE soon w/ the fix. See the associated CICE issue 94.

ShanSunNOAA · 2024-11-15T23:24:06Z

Good to know that. Thanks for the update! Shan

…

On Fri, Nov 15, 2024 at 4:21 PM Denise Worthen ***@***.***> wrote: @ShanSunNOAA <https://github.com/ShanSunNOAA> Just an update. The folks at CICE Consortium were able to replicate this in the standalone CICE and were able to commit a fix. We'll be updating CICE soon w/ the fix. See the associated CICE issue 94 <NOAA-EMC/CICE#94>. — Reply to this email directly, view it on GitHub <#2320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALORMVXA7JDAU2NPRTBLGR32AZ6XJAVCNFSM6AAAAABJEWUMFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBQGEYTOOJQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Shan Sun, Ph.D. (she) S2S Branch Chief Earth Prediction Advancement Division NOAA Global Systems Laboratory Boulder, Colorado

ShanSunNOAA added the bug Something isn't working label Jun 11, 2024

junwang-noaa mentioned this issue Jun 17, 2024

[INSTALL]: spack-stack with Intel compiler 2023.2.0/impi 2023.2.0 JCSDA/spack-stack#1147

Closed

6 tasks

junwang-noaa added this to Model infrastructure development FY2024Q4 Jun 27, 2024

junwang-noaa assigned DeniseWorthen and junwang-noaa Jul 5, 2024

DeniseWorthen mentioned this issue Jul 19, 2024

Support IntelLLVM compiler #2224

Merged

14 tasks

junwang-noaa moved this to In Progress in Model infrastructure development FY2024Q4 Jul 22, 2024

LarissaReames-NOAA mentioned this issue Sep 12, 2024

ATM: Runtime memory increases associated w/NetCDF reads #2438

Open

junwang-noaa moved this to In Progress in Model infrastructure development FY25Q1 Sep 30, 2024

junwang-noaa added this to Model infrastructure development FY25Q1 Sep 30, 2024

This was referenced Oct 8, 2024

Coupled weather model forecasts fail after large # of file writes when CICE is compiled using PIO NOAA-EMC/CICE#94

Open

Make CICE PIO compile optionally replaceable by NetCDF compile in RTs #2460

Open

LarissaReames-NOAA mentioned this issue Oct 22, 2024

Add long forecast SFS Regression Test #2478

Draft

14 tasks

junwang-noaa mentioned this issue Nov 12, 2024

[DRAFT] Remove local copies of FindESMF from all components, update ESMF to 8.6.1 and MAPL to 2.46.3 #2406

Draft

14 tasks

model crashed at day 91.75 when compiled with S2S #2320

model crashed at day 91.75 when compiled with S2S #2320

Comments

ShanSunNOAA commented Jun 11, 2024

Description

To Reproduce:

Additional context

Output

junwang-noaa commented Jun 11, 2024

DeniseWorthen commented Jun 11, 2024

ShanSunNOAA commented Jun 11, 2024

junwang-noaa commented Jun 11, 2024

ShanSunNOAA commented Jun 11, 2024

DeniseWorthen commented Jun 11, 2024

DeniseWorthen commented Jun 11, 2024

ShanSunNOAA commented Jun 12, 2024

junwang-noaa commented Jun 12, 2024

ShanSunNOAA commented Jun 12, 2024

DeniseWorthen commented Jun 12, 2024

DeniseWorthen commented Jun 12, 2024

ShanSunNOAA commented Jun 12, 2024

DeniseWorthen commented Jun 13, 2024

ShanSunNOAA commented Jun 13, 2024

DeniseWorthen commented Jun 13, 2024

junwang-noaa commented Jun 17, 2024

DeniseWorthen commented Jun 17, 2024

junwang-noaa commented Jun 17, 2024

ShanSunNOAA commented Jun 18, 2024

DeniseWorthen commented Jun 27, 2024

junwang-noaa commented Jun 28, 2024

ShanSunNOAA commented Jun 28, 2024

DeniseWorthen commented Jun 28, 2024 • edited Loading

DeniseWorthen commented Jun 29, 2024

DeniseWorthen commented Jun 30, 2024

junwang-noaa commented Jul 1, 2024

DeniseWorthen commented Jul 1, 2024

DeniseWorthen commented Jul 1, 2024 • edited Loading

DeniseWorthen commented Jul 1, 2024

DeniseWorthen commented Jul 11, 2024

DeniseWorthen commented Jul 12, 2024

DeniseWorthen commented Jul 15, 2024

XiaqiongZhou-NOAA commented Jul 30, 2024 • edited Loading

LarissaReames-NOAA commented Sep 12, 2024 • edited Loading

LarissaReames-NOAA commented Sep 25, 2024

DeniseWorthen commented Oct 7, 2024

DeniseWorthen commented Oct 12, 2024

DeniseWorthen commented Nov 15, 2024

ShanSunNOAA commented Nov 15, 2024 via email

DeniseWorthen commented Jun 28, 2024 •

edited

Loading

DeniseWorthen commented Jul 1, 2024 •

edited

Loading

XiaqiongZhou-NOAA commented Jul 30, 2024 •

edited

Loading

LarissaReames-NOAA commented Sep 12, 2024 •

edited

Loading