Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix workflow on Cheyenne #672

Merged
merged 2 commits into from
Feb 4, 2022

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Feb 3, 2022

DESCRIPTION OF CHANGES:

This pull request supersedes #670

A couple of fixes to get the workflow running on Cheyenne.

  • Remove module purge from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR Enhance ability to use template variables #650, but it should be removed anyway as it can cause future issues
  • Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne.

TESTS CONDUCTED:

Cheyenne

Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue #673 for more details.

Successful tests:

  • grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR
  • grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
  • grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta

Unsuccessful tests:

  • All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp)
  • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
  • GST_release_public_v1
    • Hit walltime limit

Hera, Jet, and Orion

Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16).

To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.

DEPENDENCIES:

This will need to be merged prior to ufs-community/ufs-srweather-app#206

ISSUE:

#663 has technically already been resolved, but this will fully address that specific issue.

@mkavulich mkavulich added bug Something isn't working Tested on Hera Tested successfully on Hera machine Tested on Cheyenne Successfully tested on NCAR Cheyenne machine Tested on Orion Tests ran successfully on MSU Orion machine Tested on Jet Successfully tested on Jet machine labels Feb 3, 2022
@mkavulich mkavulich self-assigned this Feb 3, 2022
@mkavulich mkavulich changed the title Feature/fix cheyenne run Fix workflow on Cheyenne Feb 4, 2022
@mkavulich mkavulich merged commit f5f8158 into ufs-community:develop Feb 4, 2022
mkavulich added a commit to ufs-community/ufs-srweather-app that referenced this pull request Feb 4, 2022
…l_workflow hash (#206)

## DESCRIPTION OF CHANGES: 
With changes in regional_workflow necessary for fixing of some problems on the Cheyenne platform, the ufs-srweather-app hash needs to be updated, and a `module purge` command needs to be added to the beginning of the Orion platform's build module. This ensures a clean and consistent build environment and avoids tricky-to-solve environment-specific build errors.

## TESTS CONDUCTED: 
Testing complete on Cheyenne, Orion, Hera, and Jet. All tests pass aside from those with pre-existing issues; see ufs-community/regional_workflow#672

## ISSUE: 
Related to issue ufs-community/regional_workflow#663
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Tested on Cheyenne Successfully tested on NCAR Cheyenne machine Tested on Hera Tested successfully on Hera machine Tested on Jet Successfully tested on Jet machine Tested on Orion Tests ran successfully on MSU Orion machine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants