Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden crashes with excessive SSH #55

Open
aekiss opened this issue Aug 14, 2023 · 12 comments
Open

Sudden crashes with excessive SSH #55

aekiss opened this issue Aug 14, 2023 · 12 comments
Assignees

Comments

@aekiss
Copy link
Contributor

aekiss commented Aug 14, 2023

MOM6-CICE6 1° configs are crashing after running for several weeks/months. Excessively large SSH appears in less than 1 day, without unusual wind stress - see
ACCESS-NRI/access-om3-configs#5 (comment)
ACCESS-NRI/access-om3-configs#5 (comment)

@aekiss
Copy link
Contributor Author

aekiss commented Sep 22, 2023

The latest commit on the 1deg_jra55do_ryf branch of MOM6-CICE6 crashes after model date = 0001-10-12T00:00:00 with

WARNING from PE    31: Extreme surface sfc_state detected: i= 329 j= 194 lon=  48.500 lat=  26.524 x=  48.500 y
=  26.524 D= 1.1806E+01 SSH= 1.0551E+01 SST= 2.5600E+01 SSS= 4.5001E+01 U-= 0.0000E+00 U+=-1.0853E-02 V-= 0.000
0E+00 V+= 7.3564E-03

This is at the head of the Persian Gulf. This crash seems nearly identical to the previous test (same location and date, nearly the same SSH): ACCESS-NRI/access-om3-configs#5 (comment).
Run dir:
/home/156/aek156/payu/MOM6-CICE6-1deg_jra55do_ryf

@aekiss
Copy link
Contributor Author

aekiss commented Sep 22, 2023

changing DTBT from -0.95 to -0.5 (roughly halving barotropic timestep) makes no difference

WARNING from PE    31: Extreme surface sfc_state detected: i= 329 j= 194 lon=  48.500 lat=  26.524 x=  48.500 y=  26.524 D= 1.1806E+01 SSH= 1.0589E+01 SST= 2.5610E+01 SSS= 4.5001E+01 U-= 0.0000E+00 U+=-1.0604E-02 V-= 0.0000E+00 V+= 6.4269E-03

@aekiss
Copy link
Contributor Author

aekiss commented Sep 22, 2023

Also crashes identically with the latest ACCESS-OM3 commit 377c1fc (unsurprising, as this just adds the GPTL timing library).

@aekiss
Copy link
Contributor Author

aekiss commented Oct 30, 2023

The 1deg_jra55do_ryf and 1deg_jra55do_iaf configs of MOM6-CICE6 run happily with more lenient surface checks, using values from mom6-om4-025/MOM_input (RH column) instead of defaults (LH column):

Variable archive/
output008/
MOM_parameter_doc.all
archive/
output009/
MOM_parameter_doc.all
bad_val_ssh_max 20.0 50.0
bad_val_sss_max 45.0 75.0
bad_val_sst_max 45.0 55.0
bad_val_sst_min -2.1 -3.0

aekiss added a commit to ACCESS-NRI/access-om3-configs that referenced this issue Oct 31, 2023
@ezhilsabareesh8
Copy link
Contributor

MOM6-CICE6-WWIII configuration crashes at the same location as MOM6-CICE6 after running MOM Date 1/10/08 00:00:00 , The SSH and SST limit mentioned above is not implemented yet.

WARNING from PE 31: Extreme surface sfc_state detected: i= 329 j= 194 lon= 48.500 lat= 26.524 x= 48.500 y= 26.524 D= 1.1806E+01 SSH= 1.0461E+01 SST= 2.5886E+01 SSS= 4.5002E+01 U-= 0.0000E+00 U+=-4.0502E-02 V-= 0.0000E+00 V+=-2.2684E-02

aekiss added a commit to ACCESS-NRI/access-om3-configs that referenced this issue Feb 21, 2024
@aekiss
Copy link
Contributor Author

aekiss commented Feb 22, 2024

Using more lenient checks from from mom6-om4-025 allows the MOM6-CICE6 1° run to proceed for at least 2 years with no issues.

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/namelist-configuration-discussion-meeting/1917/9

@aekiss
Copy link
Contributor Author

aekiss commented May 16, 2024

Maybe fixing this will help? #164

@ezhilsabareesh8
Copy link
Contributor

ezhilsabareesh8 commented May 17, 2024

Maybe fixing this will help? #164

Thanks @aekiss. I am currently experiencing crashes in the MOM6-CICE6 1 deg IAF and RYF configs (main branch) after a few months of runtime (3-4 months). Each failure appears to be due to different reasons (I have listed a few error logs below).

Test experiment 1 - IAF 1deg

WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered

Image              PC                Routine            Line        Source
libpthread-2.28.s  00001540FACF5CF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  00000000039D6EBC  diag_manager_mod_        3234  diag_manager.F90
access-om3-MOM6-C  00000000039BDF18  diag_manager_mod_        1466  diag_manager.F90
access-om3-MOM6-C  000000000350D06E  mom_diag_manager_         348  MOM_diag_manager_infra.F90
access-om3-MOM6-C  0000000003095BAE  mom_diag_mediator        1784  MOM_diag_mediator.F90
access-om3-MOM6-C  0000000003094202  mom_diag_mediator        1625  MOM_diag_mediator.F90
access-om3-MOM6-C  00000000035A1AAD  mom_dynamics_spli        1051  MOM_dynamics_split_RK2.F90
access-om3-MOM6-C  0000000002E49A33  mom_mp_step_mom_d        1173  MOM.F90
access-om3-MOM6-C  0000000002E4058B  mom_mp_step_mom_          853  MOM.F90
access-om3-MOM6-C  0000000002E1496D  mom_ocean_model_n         633  mom_ocean_model_nuopc.F90
access-om3-MOM6-C  0000000002D3505D  mom_cap_mod_mp_mo        1759  mom_cap.F90

Test experiment 2 - RYF 1 deg

WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread-2.28.s  00001482944AFCF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  00000000037C6D16  mom_vert_friction        1713  MOM_vert_friction.F90
access-om3-MOM6-C  0000000003597CC4  mom_dynamics_spli         581  MOM_dynamics_split_RK2.F90
access-om3-MOM6-C  0000000002E49A33  mom_mp_step_mom_d        1173  MOM.F90
access-om3-MOM6-C  0000000002E4058B  mom_mp_step_mom_          853  MOM.F90
access-om3-MOM6-C  0000000002E1496D  mom_ocean_model_n         633  mom_ocean_model_nuopc.F90
access-om3-MOM6-C  0000000002D3505D  mom_cap_mod_mp_mo        1759  mom_cap.F90
access-om3-MOM6-C  00000000020A73BF  _ZNK5ESMCI13Metho         377  ESMCI_MethodTable.C
access-om3-MOM6-C  00000000020A7338  _ZN5ESMCI11Method         563  ESMCI_MethodTable.C
access-om3-MOM6-C  00000000020A5DBB  c_esmc_methodtabl         317  ESMCI_MethodTable.C
access-om3-MOM6-C  0000000000DFD539  esmf_attachmethod        1287  ESMF_AttachMethods.F90
access-om3-MOM6-C  0000000004B83C92  nuopc_modelbase_m        2212  NUOPC_ModelBase.F90

@aekiss
Copy link
Contributor Author

aekiss commented May 21, 2024

@ezhilsabareesh8 is this crashing even with more lenient checks?

@ezhilsabareesh8
Copy link
Contributor

@ezhilsabareesh8 is this crashing even with more lenient checks?

Thanks @aekiss. With the recent changes of setting Z_INIT_REMAP_GENERAL = True and MAX_DELTA_SRESTORE = 999.0, the 1-degree MOM6-CICE6 IAF configuration is now running for 3 years without crashing, even without lenient checks.

@ezhilsabareesh8
Copy link
Contributor

ezhilsabareesh8 commented May 22, 2024

Test experiment 2 - RYF 1 deg
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 00001482944AFCF0 Unknown Unknown Unknown
access-om3-MOM6-C 00000000037C6D16 mom_vert_friction 1713 MOM_vert_friction.F90
access-om3-MOM6-C 0000000003597CC4 mom_dynamics_spli 581 MOM_dynamics_split_RK2.F90

The RYF 1-degree MOM6-CICE6 configuration still crashes with the above error. However, there is a significant difference between the MOM_input of the IAF and RYF configurations, which may be causing the error in RYF but not in IAF. The IAF MOM_input is outdated and needs to be updated.

<style> </style>
variable MOM_input_one_deg_RYF MOM_input_one_deg_IAF
adjust_net_srestore_to_zero   TRUE
ah_vel_scale   0
bbl_use_eos   TRUE
bt_thick_scheme   FROM_BT_CONT
cfc_bc_file   cfc_atm_20230310.nc
coord_config   none
debug   FALSE
default_2018_answers   FALSE
depth_scaled_khth   FALSE
energysavedays   1
eqn_of_state   WRIGHT
fatal_unused_params TRUE  
fix_ustar_gustless_bug   TRUE
gill_equatorial_ld   TRUE
grid_rotation_angle_bugs   FALSE
hmix_min   2
int_tide_decay_scale   300.3003003003003
interp_type2   LMD94
interpolate_res_fn   FALSE
kappa_shear_all_layer_tke_bug   FALSE
kappa_shear_iter_bug   FALSE
kdml   0
kh_vel_scale   0
khth   0
khth_max   0
khtr_max   0
mask_srestore_under_ice   FALSE
max_ent_it   20
max_rino_it   25
maxtrunc   0
min_salinity   0
nihalo   4
njhalo   4
prandtl_turb   1
remap_uv_using_old_alg   FALSE
simple_tke_to_kd   TRUE
smag_bi_const   0.06
tolerance_ent   1e-05
topo_file   topog.nc
use_cfc_cap   FALSE
use_contemp_abssal   FALSE
use_gm_work_bug   FALSE
use_land_mask_for_hvisc   TRUE
use_psurf_in_eos   TRUE
visc_res_scale_coef   0.4
z_init_file_salt_var   salt
z_init_remap_old_alg   FALSE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

4 participants