Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module purge to beginning of orion build module, update regional_workflow hash #206

Merged

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Feb 1, 2022

DESCRIPTION OF CHANGES:

With changes in regional_workflow necessary for fixing of some problems on the Cheyenne platform, the ufs-srweather-app hash needs to be updated, and a module purge command needs to be added to the beginning of the Orion platform's build module. This ensures a clean and consistent build environment and avoids tricky-to-solve environment-specific build errors.

TESTS CONDUCTED:

Testing complete on Cheyenne, Orion, Hera, and Jet. All tests pass aside from those with pre-existing issues; see regional_workflow PR for more details

DEPENDENCIES:

ISSUE (optional):

Related to issue ufs-community/regional_workflow#663

…to updated regional_workflow hash (needs to be updated after that PR is merged)
@mkavulich mkavulich self-assigned this Feb 1, 2022
@mkavulich mkavulich added Tested on Cheyenne Successfully tested on Cheyenne machine Tested on Hera Tested successfully on Hera machine Tested on Jet Successfully tested on Jet machine labels Feb 3, 2022
@mkavulich mkavulich marked this pull request as ready for review February 4, 2022 15:37
Externals.cfg Outdated Show resolved Hide resolved
mkavulich added a commit to ufs-community/regional_workflow that referenced this pull request Feb 4, 2022
## DESCRIPTION OF CHANGES: 

A couple of fixes to get the workflow running on Cheyenne.

 - Remove `module purge` from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR #650, but it should be removed anyway as it can cause future issues
 - Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne.

## TESTS CONDUCTED: 
### Cheyenne
Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue #673 for more details.

**Successful tests:**
 - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR
 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta

**Unsuccessful tests:**
 - All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp)
 - grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
 - GST_release_public_v1
   - Hit walltime limit

### Hera, Jet, and Orion
Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16).

**To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.**

## DEPENDENCIES:
This will need to be merged prior to ufs-community/ufs-srweather-app#206

## ISSUE: 
#663 has technically already been resolved, but this will fully address that specific issue.
@mkavulich mkavulich merged commit 318f258 into ufs-community:develop Feb 4, 2022
mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this pull request Aug 26, 2022
## DESCRIPTION OF CHANGES: 

A couple of fixes to get the workflow running on Cheyenne.

 - Remove `module purge` from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR ufs-community#650, but it should be removed anyway as it can cause future issues
 - Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne.

## TESTS CONDUCTED: 
### Cheyenne
Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue ufs-community#673 for more details.

**Successful tests:**
 - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR
 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta

**Unsuccessful tests:**
 - All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp)
 - grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
 - GST_release_public_v1
   - Hit walltime limit

### Hera, Jet, and Orion
Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16).

**To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.**

## DEPENDENCIES:
This will need to be merged prior to ufs-community#206

## ISSUE: 
ufs-community#663 has technically already been resolved, but this will fully address that specific issue.
mkavulich added a commit that referenced this pull request Sep 8, 2022
## DESCRIPTION OF CHANGES: 

A couple of fixes to get the workflow running on Cheyenne.

 - Remove `module purge` from load_modules_run_task.sh. This no longer causes failures on Cheyenne due to intervening PR #650, but it should be removed anyway as it can cause future issues
 - Fixing the number of processors used in the mpirun command for the weather model on Cheyenne. I am honestly not sure how this was ever working, but this change fixes nearly all of the runtime failures currently seen on Cheyenne.

## TESTS CONDUCTED: 
### Cheyenne
Ran a set of WE2E tests on Cheyenne, chosen mostly at random to save core hours (I did ensure that a variety of domains were run so that several different MPI layouts were tested). Most tasks succeed, and all failures (aside from one walltime issue) are also tests that fail on Hera with the current develop branch. See issue #673 for more details.

**Successful tests:**
 - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR
 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
 - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta

**Unsuccessful tests:**
 - All gfdlmp tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp)
 - grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
 - GST_release_public_v1
   - Hit walltime limit

### Hera, Jet, and Orion
Ran the same set of tests on Hera, Jet, and Orion, with similar results. On Hera the GST successfully completed (though was close to reaching the walltime limit). On Jet, a few tests (grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR, grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta) failed due to missing initial and/or lateral boundary conditions. On Orion, even more tests failed due to missing ICs and LBCs (grid_GSD_HRRR_AK_50km_ics_RAP_lbcs_RAP_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_2017_gfdlmp, grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16).

**To summarize, the only test failures were those that were also seen in develop, and mostly due to missing input files on those platforms.**

## DEPENDENCIES:
This will need to be merged prior to #206

## ISSUE: 
#663 has technically already been resolved, but this will fully address that specific issue.
SamuelTrahanNOAA pushed a commit to SamuelTrahanNOAA/ufs-srweather-app that referenced this pull request Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tested on Cheyenne Successfully tested on Cheyenne machine Tested on Hera Tested successfully on Hera machine Tested on Jet Successfully tested on Jet machine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants