From c6d9937707ec8d5839b5045d0b00e18c84a3cd47 Mon Sep 17 00:00:00 2001 From: Michael Kavulich Date: Fri, 4 Feb 2022 09:48:44 -0700 Subject: [PATCH] Fix workflow on Cheyenne (#672) ## DESCRIPTION OF CHANGES: A couple of fixes to get the workflow running on Cheyenne. - Remove `module purge` from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR #650, but it should be removed anyway as it can cause future issues - Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne. ## TESTS CONDUCTED: ### Cheyenne Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue #673 for more details. **Successful tests:** - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta **Unsuccessful tests:** - All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp) - grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16 - GST_release_public_v1 - Hit walltime limit ### Hera, Jet, and Orion Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16). **To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.** ## DEPENDENCIES: This will need to be merged prior to https://github.com/ufs-community/ufs-srweather-app/pull/206 ## ISSUE: #663 has technically already been resolved, but this will fully address that specific issue. --- ush/load_modules_run_task.sh | 2 -- ush/machine/cheyenne.sh | 2 +- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/ush/load_modules_run_task.sh b/ush/load_modules_run_task.sh index 59e3e156f8..28b0f56471 100755 --- a/ush/load_modules_run_task.sh +++ b/ush/load_modules_run_task.sh @@ -132,8 +132,6 @@ jjob_fp="$2" #----------------------------------------------------------------------- # -module purge - machine=$(echo_lowercase $MACHINE) env_fp="${SR_WX_APP_TOP_DIR}/env/${BUILD_ENV_FN}" module use "${SR_WX_APP_TOP_DIR}/env" diff --git a/ush/machine/cheyenne.sh b/ush/machine/cheyenne.sh index d32b1f90da..a707901dbe 100755 --- a/ush/machine/cheyenne.sh +++ b/ush/machine/cheyenne.sh @@ -56,7 +56,7 @@ FIXLAM_NCO_BASEDIR=${FIXLAM_NCO_BASEDIR:-"/needs/to/be/specified"} RUN_CMD_SERIAL="time" RUN_CMD_UTILS='mpirun -np $nprocs' -RUN_CMD_FCST='mpirun -np $nprocs' +RUN_CMD_FCST='mpirun -np ${PE_MEMBER01}' RUN_CMD_POST='mpirun -np $nprocs' # MET Installation Locations