-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model crashed at day 91.75 when compiled with S2S #2320
Comments
@DeniseWorthen may I ask if there is any restriction on cice file names for long forecast time? |
@junwang-noaa This doesn't seem to be an issue related to filename length. Files are always named as Do you have the associated |
@DeniseWorthen @junwang-noaa Thanks for the quick responses from both of you. I think I found out why. It is this line in cice_wrapper_mod.F90: write(filename,'(a,i3.3)')'log.ice.f',int(hour) Thanks! |
It might be the cause. I saw a file 'log.ice.f***' in the run directory. We may still run into problem for climate runs if using "i4.4". In fv3atm, we use:
to get the number of digits for forecast hours |
@junwang-noaa Thanks for the info. FV3 uses "max" for the forecast hours so the model won't crash. Can we borrow the same formula to get forecast hour in CICE? |
Both ATM and CICE write to a specific file name. In MOM, this was just added as a write to stdout. We should probably fix that. For MOM and CICE, we'd need to "find" the number of hours from the driver clock since currently fhmax is only in model_configure. |
@ShanSunNOAA Unless you're actually compiling runtime statistics (ie, the time for each model advance), then I would just set RunTimeLog = false in the |
@DeniseWorthen Thanks for your info. I cannot find RunTimeLog in the log file generated by workflow. Where do I set it in the workflow configuration? Thanks! |
@DeniseWorthen in Fv3ATM, we compute the number of digits from the forecast hour, not fhmax. Can we use "hour" in ufs_logfhour to compute the format of fhour in the log file name? Also it looks to me that runtimelog is not used to control outputting cice log file, I saw the code in nuopc/cmeps/CICE_RunMod.F90.
|
Unfortunately, after modifying the log.ice filename, the model still crashed at day 91.75, with error see /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/logs/2015052100/gfsfcst.log Any suggestions on how to run beyond 3 months would be greatly appreciated. How to turn off RunTimeLog? Thanks. |
@ShanSunNOAA Sorry for the confusion. Jun is right. The feature for runtime logging as well as the logging to mark the end of the history write was put in at the same time and I didn't think through my answer. I don't believe CICE has a concept of hours being anything other than 1-24. Let me take a look. In any case, it doesn't seem that is the issue, since your new run failed also. If it is still pointing to the ice_pio line, then I still suspect it has something to do w/ how pio is trying to write the file. |
@ShanSunNOAA I've been able to reproduce the error using the low resolution c48-coupled model in the RTs. So in this case, there is no sym-linked files, so that takes one issue out of the mix. In this case, it failed at day 86.
Jun suggested a couple of further tests. I'll keep you updated. |
@DeniseWorthen Thank you for your prompt response and the clever approach. Please let me know if there is anything I can assist with. |
I added the following to the cice_wrapper
And did the following tests:
and the model ran out to 100days. |
@DeniseWorthen Thanks for your investigation. Where to adjust or add "I_MPI_SHM_HEAP_VSIZE" when running UFS in the workflow environment, i.e., config.ice or config.fcst? Thanks. |
@ShanSunNOAA I don't know where to add this in the workflow. It is not an ice configuration specifically. I don't know where the equivalent of the job card gets generated. For the RTs, I added it just after the OMP_STACKSIZE
|
Another solution is to use intel 2023.02. Hera admin confirmed the Intel/2023.2.1 and impi/2023.2.1 are available on hera, it's worth to try the new version of intel compiler to see if this issue is resolved. |
@junwang-noaa This requires a spack-stack to be built w/ these compilers, correct? |
Yes, a spack-stack issue is created. |
Forgot to mention that Jessica suggested adding "export I_MPI_SHM_HEAP_VSIZE=16384" in the HERA.env. |
I updated the model by one commit (to the 1.6 SS) and used the updated intel compiler build provided in JCSDA/spack-stack#1147
I re-ran the cpld c48 case w/o the |
Great! So this does resolved the issue! Thanks for testing. |
@DeniseWorthen Thanks for testing this solution, without increasing "I_MPI_SHM_HEAP_VSIZE". That is a great news. |
I've deleted my previous comment about letting the intel 2023.2 test to run as long as possible. I was using the old executable. I will try a clean test. |
I repeated my test using the 2023.2 intel w/o the heap_vsize variable and it completed fh3168 (132 days) before failing w/ the same error
I'll try a second test adding the vsize variable. |
The 2023.2 intel case with the addition of the heap_vzsize variable completed fh6816 (284 days) before timing-out. |
@DeniseWorthen Do we know if setting "I_MPI_SHM_HEAP_VSIZE=16384" only will allow the test case to run through 284 days? |
@junwang-noaa That was my next step. I will re-compile w/ the default SS16 and re-run the test. |
Note, the tests with the intel 2023.2 also included the addition of a
I had tested this previously with SS16 and w/o the heap variable and it did not resolve the error. I will leave it in for consistency. |
I recompiled w/ the default SS16 modules and re-ran using only the heap_vsize variable with the Q wall clock set to the maxiumum=8hours. In this case, the model was unstable and failed with
The failure occurred at fh2574 (107 days). |
Per the conversation in the Scalability meeting today, the previous discussion on locating memory leaks in the system is #779 |
I have a similar problem on WCOSS2. The coupled forecasts always crashed after reaching 135 days on WCOSS2. |
As a new member of the SFS team, @junwang-noaa has asked me to look in to this issue. With the help of @DeniseWorthen I was able to get a few test runs in with memory profiling. I first wanted to make sure this memory issue wasn't present in the hash that @ShanSunNOAA reported as working previously (45c8b2a). Because the module files in that branch are no longer available I used the modules in the current develop branch along with a test using the same intel2023.2.0 installation that Denise used above.. The same memory behavior is present in the C48 ATM test at this commit with both compilers (2021 first, 2023 second) though they aren't identical. These results made me think perhaps this is a library issues associated (somehow) with NetCDF reads. I decided to try a run of the regional_control ATM test which isn't run long enough to read new climo files on the 15th of the month, but it does read lateral boundary condition files (also in NetCDF format) every 180 minutes. We see similar memory jumps for these runs at every LBC read. Again, intel2021 first, 2023 second. Assuming this memory behavior is the source of the long run failures, then it looks like perhaps this is not a code issue if @ShanSunNOAA was able to use this hash successfully in the past but with different modules (and perhaps on non-rocky8 OS on Hera). I have a C48 run going right now with the intel 2023 executable with a full forecast length of 3200 hours to see how far it makes it. I'll report back on that once it's complete and will also open a new issue specific to these memory problems as they seem more pervasive than a single coding bug. EDIT: Note I am using more than the standard allocation of 8 processes on a single node for the C48 tests to decrease runtime. I'll make sure to switch back to a single node when I check for long run failure. |
Based on the test from my previous comment, I decided to start investigating other sources of the Assertion Failed failure message that @ShanSunNOAA originally reported by running S2S regression tests. After lots of unstable forecasts with the cpld_control_c48, I've been able to run several tests using cpld_control_nowave_noaero_p8 in #8a5f711, the first hash after Shan's working version. After confirming that the Assertion Failed error does not occur with Shan's working hash (it does not), I can confirm that the Assertion Failed issues occur at #8a5f711. When I set CICE_IO = "NetCDF", instead of the default (as used in the failed runs) "PIO" as introduced in WM PR #2145, and restart/history_format='hdf5' or 'default' in ice_in, the failure disappears and the model runs to the 8 hour max wall clock time. As additional tests, I ran again with CICE_IO="PIO" and restart/history_format="pnetcdf1" and "pnetcdf5" and both of these runs also fail. This pretty clearly points to either the PIO library or its implementation as the source of the Assertion Failed errors. I did one last test this afternoon to see if these are accumulating errors or to see if something specific in various ICs might be triggering it somehow. I left the ICE restart frequency the same (every 12 hours) but doubled the ICE history frequency from 6 to 3 hours. The model failed in 60% of the time (1210 FHR vs 2010 FHR) after writing 504 files (404 history & 100 restart) whereas the 6hr frequency wrote 501 files (335 hist, 167 restart). This suggest that there is some accumulating memory issue that is compounded every time an ICE history or restart file is written. |
Following today's infrastructure meeting, I set up the cpld_control_c48 test case on Gaea. I used dt_atmos=900 and set the CICE history write for hourly output. I turned off the restart writing for all components. The model gave 675 hourly output files (28 days) before failing w/ the following. Rank 17 is one of the CICE ranks (petlist 16 19). This is similar to the message posted above for WCOSS2.
|
I repeated my test on Gaea using CICE hash 6449f40, which is prior to both PIO related commits (7a4b95e and aca8357). I compiled w/ PIO and used the previous namelist options for using pnetcdf. The model stopped with an identical error I posted above. This indicates that whatever the cause of this failure, it predates the recent updates to PIO in CICE. |
@ShanSunNOAA Just an update. The folks at CICE Consortium were able to replicate this in the standalone CICE and were able to commit a fix. We'll be updating CICE soon w/ the fix. See the associated CICE issue 94. |
Good to know that. Thanks for the update!
Shan
…On Fri, Nov 15, 2024 at 4:21 PM Denise Worthen ***@***.***> wrote:
@ShanSunNOAA <https://github.com/ShanSunNOAA> Just an update. The folks
at CICE Consortium were able to replicate this in the standalone CICE and
were able to commit a fix. We'll be updating CICE soon w/ the fix. See the
associated CICE issue 94 <NOAA-EMC/CICE#94>.
—
Reply to this email directly, view it on GitHub
<#2320 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALORMVXA7JDAU2NPRTBLGR32AZ6XJAVCNFSM6AAAAABJEWUMFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBQGEYTOOJQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Shan Sun, Ph.D. (she)
S2S Branch Chief
Earth Prediction Advancement Division
NOAA Global Systems Laboratory
Boulder, Colorado
|
Description
The UFS model crashed at time step 13320 with error of
341: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
... ...
345: ufs_model.x 0000000005CA5A13 ice_pio_mp_ice_pi 221 ice_pio.F90
345: ufs_model.x 0000000005C85EFE ice_history_write 173 ice_history_write.F90
345: ufs_model.x 0000000005A1225A ice_history_mp_ac 4130 ice_history.F90
345: ufs_model.x 0000000005C7B813 cice_runmod_mp_ci 369 CICE_RunMod.F90
345: ufs_model.x 00000000059BED10 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90
345: ufs_model.x 00000000006A1B08 Unknown Unknown Unknown
see details in /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log
To Reproduce:
The experiment used the following hash of https://github.com/ufs-community/ufs-weather-model
commit 5bec704
Author: Brian Curtis [email protected]
Date: Fri May 31 14:52:06 2024 -0400
Additional context
The model was able to run 9 months successfully with an earlier hash in April:
commit 45c8b2a
Author: JONG KIM [email protected]
Date: Thu Apr 4 19:49:13 2024 -0400
Output
The log containing the error message is available at /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log. I'd be happy to provide more details if needed. Thanks.
The text was updated successfully, but these errors were encountered: