Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update damping and time step parameters #2363

Merged
merged 37 commits into from
Sep 13, 2024

Conversation

dpsarmie
Copy link
Collaborator

@dpsarmie dpsarmie commented Jul 11, 2024

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR changes the damping and time step parameters to new updated values that were created for Global Workflow (NOAA-EMC/global-workflow#2575). These values should improve reliability and allow for longer timesteps to be used in the model. A new section was introduced in default_vars.sh because the parameter values are resolution dependent.

Commit Message:

* UFSWM - Update default parameter values for damping based off of work done for GW 

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Updates/Changes Baselines.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

dpsarmie and others added 2 commits July 11, 2024 13:27
This commit changes the damping parameters in default_vars.sh
and global_control.nml.IN to reflect new optimal values that
are going to be used in the global workflow.
@dpsarmie dpsarmie added the Baseline Updates Current baselines will be updated. label Jul 11, 2024
@dpsarmie
Copy link
Collaborator Author

There are some reliability issues occurring with testing on Hercules (conus13km_debug_2threads_intel was not generating files on a consistent basis). I will do more testing on Hercules and test on Hera to see if there are similar issues.

I will also document changes to run times if significant.

K_SPLIT variable in default_vars will need to be moved/cleaned up.

Added quotes to fix SuperLinter issue.
@dpsarmie dpsarmie self-assigned this Jul 11, 2024
@XiaqiongZhou-NOAA
Copy link
Contributor

could you also add corresponding changes related to NOAA-EMC/global-workflow#2751

@CatherineThomas-NOAA
Copy link

@XiaqiongZhou-NOAA : There have been problems with the omega PR #2327 in other applications and it's being reverted. Take a look at PR #2362 and Issue #346.

@XiaqiongZhou-NOAA
Copy link
Contributor

XiaqiongZhou-NOAA commented Jul 15, 2024

@XiaqiongZhou-NOAA : There have been problems with the omega PR #2327 in other applications and it's being reverted. Take a look at PR #2362 and Issue #346.

I assume PR#2362 will be merged soon and Issue#346 already has a fix. It has already a fix NOAA-GFDL/GFDL_atmos_cubed_sphere#349

@dpsarmie
Copy link
Collaborator Author

@XiaqiongZhou-NOAA Yes, I can roll the omega calc update into this PR also. There are still a handful of tests that are failing with the new damping parameters that I need to look into. If I can't resolve it soon, then I'll just make a separate PR for the omega calc.

@CatherineThomas-NOAA
Copy link

Thanks for the correction @XiaqiongZhou-NOAA, I misunderstood the state of things.

@dpsarmie
Copy link
Collaborator Author

A quick update on this PR:

The new parameters are causing multiple issues with the regression tests. I'm compiling a list of the affected runs and I'll update it later today. I've been troubleshooting and found that if I revert K_SPLIT to its original value, then the crashes are resolved.

It also seems like some of the runs have also extended their runtime with the new parameter values (and using the default K_SPLIT) but I'm trying to replicate the issue just to verify.

Either way, the omega calculation has been added to a separate PR (#2373) while this issue is addressed.

I'll post another update in here with a Google Doc documenting the affected tests.

@yangfanglin
Copy link
Collaborator

New damping settings should only be applied to 127-layer global model RTs

@junwang-noaa
Copy link
Collaborator

@yangfanglin The global tests that Dan is testing are running with L127 levels.

@yangfanglin
Copy link
Collaborator

yangfanglin commented Jul 19, 2024

Which RTs and on what platform the runs are failing ? Might be worthwhile running the global-workflow CI tests mentioned in this PR NOAA-EMC/global-workflow#2575 and crosschecking namelist settings.

@dpsarmie
Copy link
Collaborator Author

So these are the results I've compiled so far: Google Docs link

K_SPLIT and DELTIM seem to be the two variables causing the most issues.

The common error in the cpld cases is the following:
NaN in input field of mpp_reporducing_sum(_2d), this indicates numerical instability

I'm going through the other test logs to see what the issues are because some of failures are being caused by exceeding the allotted wall clock time.

@yangfanglin: Yes, I agree that we should look at the input.nml files to make sure that they are identical before diving deeper into the errors. Do you happen to have a copy of the GW CI test input.nml available? These tests were all run on Hera but I'm seeing similar issues on Hercules.

dpsarmie and others added 5 commits July 24, 2024 13:46
Adding parameters to default_vars.sh and changing two
RTs for testing.
This commit adds a new export_ugwpv1 fucntion to default_vars
that assigns the new paramter values to tests that are currently
using the version 1 of ugwp.
@dpsarmie
Copy link
Collaborator Author

dpsarmie commented Aug 9, 2024

I've added changes to default_vars.sh and am currently performing one last test that cleans up some of the variable definitions in the regression tests (there were repeated variables). I am still unable to have a stable run with the reduced value of k_split. I've generated a namelist from GW and tried to emulate it without any success.

I would ask to see if there is a namelist that uses these updated values for a resolution that is not C48. For now, I've left the default k_split. From what I've read, k_split should only affect stability and isn't changing the core physics of the system.

We should also discuss how many tests we want to move over to v1. There are multiple tests that are using v0, others are using new HR4 variables, etc. For the RTs, I'm thinking we should keep some older configurations as long as they are still supported by UFS.

Clean up some variable duplicates in the GWPv1 RTs.
Other RTs will also need to get cleaned up.
@junwang-noaa
Copy link
Collaborator

@yangfanglin Can your group help with this? Currently the configurations were updated in global workflow, but it failed in the UFS WM rt.

@XiaqiongZhou-NOAA
Copy link
Contributor

I've added changes to default_vars.sh and am currently performing one last test that cleans up some of the variable definitions in the regression tests (there were repeated variables). I am still unable to have a stable run with the reduced value of k_split. I've generated a namelist from GW and tried to emulate it without any success.

I would ask to see if there is a namelist that uses these updated values for a resolution that is not C48. For now, I've left the default k_split. From what I've read, k_split should only affect stability and isn't changing the core physics of the system.

We should also discuss how many tests we want to move over to v1. There are multiple tests that are using v0, others are using new HR4 variables, etc. For the RTs, I'm thinking we should keep some older configurations as long as they are still supported by UFS.

Can you list all n_split, k_split, detlm and resolution? Generally we can not only increase K_split, have to consider detlm/(k_split*n_split)?

@XiaqiongZhou-NOAA
Copy link
Contributor

@yangfanglin Can your group help with this? Currently the configurations were updated in global workflow, but it failed in the UFS WM rt.

Fanglin is in training. I believe @KevinViner-NOAA suggested this update.

@dpsarmie
Copy link
Collaborator Author

dpsarmie commented Aug 12, 2024

I've added changes to default_vars.sh and am currently performing one last test that cleans up some of the variable definitions in the regression tests (there were repeated variables). I am still unable to have a stable run with the reduced value of k_split. I've generated a namelist from GW and tried to emulate it without any success.
I would ask to see if there is a namelist that uses these updated values for a resolution that is not C48. For now, I've left the default k_split. From what I've read, k_split should only affect stability and isn't changing the core physics of the system.
We should also discuss how many tests we want to move over to v1. There are multiple tests that are using v0, others are using new HR4 variables, etc. For the RTs, I'm thinking we should keep some older configurations as long as they are still supported by UFS.

Can you list all n_split, k_split, detlm and resolution? Generally we can not only increase K_split, have to consider detlm/(k_split*n_split)?

We can use a C96 run as an example. n_split is 4, k_split is 1, and deltim is 600. That should match what is in the GW config (https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L132) and is being set in the default_vars.sh (https://github.com/dpsarmie/ufs-weather-model/blob/fix/updateDampingParm/tests/default_vars.sh#L944).

I was able to get the C48 runs converted over with the GW parameters and all other runs at C96 and finer are stable if k_split is set to its default value (ex. k_split = 2 for C96 runs).

Thank you for any help or information you might be able to provide.

@XiaqiongZhou-NOAA
Copy link
Contributor

XiaqiongZhou-NOAA commented Aug 12, 2024

I've added changes to default_vars.sh and am currently performing one last test that cleans up some of the variable definitions in the regression tests (there were repeated variables). I am still unable to have a stable run with the reduced value of k_split. I've generated a namelist from GW and tried to emulate it without any success.
I would ask to see if there is a namelist that uses these updated values for a resolution that is not C48. For now, I've left the default k_split. From what I've read, k_split should only affect stability and isn't changing the core physics of the system.
We should also discuss how many tests we want to move over to v1. There are multiple tests that are using v0, others are using new HR4 variables, etc. For the RTs, I'm thinking we should keep some older configurations as long as they are still supported by UFS.

Can you list all n_split, k_split, detlm and resolution? Generally we can not only increase K_split, have to consider detlm/(k_split*n_split)?

We can use a C96 run as an example. n_split is 4, k_split is 1, and deltim is 600. That should match what is in the GW config (https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.ufs#L132) and is being set in the default_vars.sh (https://github.com/dpsarmie/ufs-weather-model/blob/fix/updateDampingParm/tests/default_vars.sh#L944).

I was able to get the C48 runs converted over with the GW parameters and all other runs at C96 and finer are stable if k_split is set to its default value (ex. k_split = 2 for C96 runs).

Thank you for any help or information you might be able to provide.

If k_split=2 and n_split=4, deltim=600, this will decrease dynamic time step or acoustic time step (deltm/(k_split*n_split) too much. The new setting (k_split=1,n_split=4 and deltim=600) should be dynamically stable. The crash is probably due to other issues. It is not a good solution just decrease dynamic time step.

@dpsarmie
Copy link
Collaborator Author

dpsarmie commented Aug 12, 2024

If k_split=2 and n_split=4, deltim=600, this will decrease dynamic time step or acoustic time step (deltm/(k_split*n_split) too much. The new setting (k_split=1,n_split=4 and deltim=600) should be dynamically stable. The crash is probably due to other issues. It is not a good solution just decrease dynamic time step.

Agreed. I'm also working on unifying the namelist parameters between the RTs and GW, so having two different settings wouldn't be optimal.
Is there a namelist available for a current run available for a resolution that isn't C48? I generated a C48/C96 GW namelist but it might be beneficial to look at a working one in case I'm doing something wrong on my end. Also, which hash of the UFS model is GW currently working with?

@FernandoAndrade-NOAA
Copy link
Collaborator

FernandoAndrade-NOAA commented Sep 10, 2024

@FernandoAndrade-NOAA So you're able to run and generate the baselines for these restart tests but then they fail on a subsequent check? I'm just making sure that this isn't a runtime issue.

No errors during runtime or baseline generation, this seems to be an issue with the results not being identical, please refer to the following on Hera as an example:
/scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_815536/cpld_restart_gfsv17_intel/RESTART/

/scratch2/NAGAPE/epic/Fernando.Andrade-maldonado/regression-tests/wm/2363/ufs-weather-model/tests/logs/log_hera/rt_cpld_restart_gfsv17_intel.log

@zach1221
Copy link
Collaborator

@FernandoAndrade-NOAA So you're able to run and generate the baselines for these restart tests but then they fail on a subsequent check? I'm just making sure that this isn't a runtime issue.

No errors during runtime or baseline generation, this seems to be an issue with the results not being identical, please refer to the following on Hera as an example: /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_815536/cpld_restart_gfsv17_intel/RESTART/

/scratch2/NAGAPE/epic/Fernando.Andrade-maldonado/regression-tests/wm/2363/ufs-weather-model/tests/logs/log_hera/rt_cpld_restart_gfsv17_intel.log

I am also seeing this persistently on orion/hercules with cpld_restart_gfsv17_intel, cpld_restart_bmark_p8_intel, and cpld_restart_pdlib_p8_intel.

@BrianCurtis-NOAA
Copy link
Collaborator

It could be the bug in the generation of the rt_temp.conf where a test dependency isn't handled correctly.

@dpsarmie
Copy link
Collaborator Author

dpsarmie commented Sep 11, 2024

Ok I think I found the issue. I was able to fix the gfsv17 case by making the K_SPLIT/N_SPLIT consistent between the the control and the restart RT.
After doing this I ran the cpld_restart_gfsv17 test and created new baselines, then reran it and checked against the newly created baselines and it passed.

I've pushed changes to the other RTs that were causing issues for you all. I'm currently testing the other cases and will let you all know once those pass.

Still not 100% sure on why this is the case/fix but I'll worry about that later.

@dpsarmie
Copy link
Collaborator Author

@zach1221 @FernandoAndrade-NOAA Ok, all three restarts should be good to go. I made new baselines and checked them on Hera. Let me know if you run into any issues.

@BrianCurtis-NOAA
Copy link
Collaborator

@dpsarmie my cpld_mpi_pdlib_p8 intel tests are failing comparisons on Acorn and WCOSS2

@zach1221
Copy link
Collaborator

zach1221 commented Sep 12, 2024

@dpsarmie my cpld_mpi_pdlib_p8 intel tests are failing comparisons on Acorn and WCOSS2

I'm having this exact issue, failing to compare, on both hercules and orion. Can't say yet on Derecho as it's been super slow.

@jkbk2004
Copy link
Collaborator

It turns out this PR needs additional troubleshooting. @zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA Can we schedule #2220 to commit today?

@BrianCurtis-NOAA
Copy link
Collaborator

It turns out this PR needs additional troubleshooting. @zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA Can we schedule #2220 to commit today?

I think @dpsarmie has one last adjustment to make, and #2220 is still in draft status, i don't know why you're even considering it at this point.

@dpsarmie
Copy link
Collaborator Author

I'm testing the changes that I just merged on Hercules to see if I get the same error. It should be a quick check and I'll report back once it's done.

@dpsarmie
Copy link
Collaborator Author

Ok, cpld_mpi_pdlib and the other associated files passed on Hercules. @zach1221 @BrianCurtis-NOAA

RT.SH OPTIONS USED:
* (-m) - COMPARE AGAINST CREATED BASELINES
* (-e) - USE ECFLOW

PASS -- COMPILE 's2sw_pdlib_intel' [19:11, 17:39] ( 1 warnings 10 remarks )
PASS -- TEST 'cpld_control_pdlib_p8_intel' [16:26, 13:54](2070 MB)
PASS -- TEST 'cpld_restart_pdlib_p8_intel' [09:22, 06:52](1392 MB)
PASS -- TEST 'cpld_mpi_pdlib_p8_intel' [18:28, 15:30](1995 MB)

@zach1221
Copy link
Collaborator

Ok @BrianCurtis-NOAA @jkbk2004 testing is complete. We're ready for review.

@zach1221 zach1221 self-requested a review September 13, 2024 14:01
@zach1221 zach1221 merged commit 7062191 into ufs-community:develop Sep 13, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update damping and time step for global RT tests