Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix modulefiles for Hera/Rocky8 OS. #2194

Merged
merged 10 commits into from
Mar 22, 2024

Conversation

RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA RatkoVasic-NOAA commented Mar 15, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Hera is switching to new OS. This is update to enable ufs-weather-model to run on Rocky8 OS.
Necessary changes are made to spack-stack libraries.
NOTE! Since different version of openmpi is used, results change when using GNU compiler.

Commit Message:

  • update module paths in Hera intel/gnu lua files for Rocky8 OS spack-stack libraries.

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Updates/Changes Baselines. (Only GNU is expected to change results)

Input data Changes:

  • None.

Library Changes/Upgrades:

Library changes are included in this PR (spack-stack).


Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@RatkoVasic-NOAA
Copy link
Collaborator Author

This should be tested on Rocky login nodes! hfe09-hfe12.

@junwang-noaa
Copy link
Collaborator

@RatkoVasic-NOAA With this module update, we can only run UFS weather model code on hfe09-hfe12, can we run the model on other hera nodes when your PR is committed? Thanks

@RatkoVasic-NOAA
Copy link
Collaborator Author

@RatkoVasic-NOAA With this module update, we can only run UFS weather model code on hfe09-hfe12, can we run the model on other hera nodes when your PR is committed? Thanks

@junwang-noaa , no. You cannot use both. But, you can use old modulefiles (ufs_hera.gnu.lua ufs_hera.intel.lua), which are in use now. More and more resources are moved from CentOS to Rocky8, so it might be good to switch sooner than later.

@jkbk2004
Copy link
Collaborator

@RatkoVasic-NOAA can you continue to sync up branch? we may need to schedule this pr tomorrow.

@RatkoVasic-NOAA
Copy link
Collaborator Author

@RatkoVasic-NOAA can you continue to sync up branch? we may need to schedule this pr tomorrow.

Done.

@SamuelTrahanNOAA
Copy link
Collaborator

The control_wam_debug_gnu test failed for me with a floating-point exception. I haven't tried resubmitting yet.

It is here:

HERA: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3224016/control_wam_debug_gnu

Big long backtrace
srun: lua: This job was submitted from a host running Rocky 8. Assigning job to el8 reservation.
 21: 
 21: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 21: 
 21: Backtrace for this error:
 45: 
 45: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 45: 
 45: Backtrace for this error:
 21: #0  0x14c10f100b4f in ???
 21: #1  0x1ce24ec in __nh_utils_mod_MOD_sim1_solver
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/model/nh_utils.F90:1557
 21: #2  0x1d0f9c7 in __nh_utils_mod_MOD_riem_solver_c._omp_fn.0
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/model/nh_utils.F90:481
 21: #3  0x14c111078721 in GOMP_parallel
 21:    at /tmp/role.apps/spack-stage/spack-stage-gcc-9.2.0-ku6r4f5qa5obpfnqpa6pezhogxq6sp7h/spack-src/libgomp/parallel.c:171
 21: #4  0x1d02ae4 in __nh_utils_mod_MOD_riem_solver_c
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/model/nh_utils.F90:501
 21: #5  0x18392b8 in __dyn_core_mod_MOD_dyn_core
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/model/dyn_core.F90:636
 21: #6  0x18e743f in __fv_dynamics_mod_MOD_fv_dynamics
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:691
 21: #7  0x178590a in __atmosphere_mod_MOD_atmosphere_dynamics
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:699
 21: #8  0x146d8c4 in __atmos_model_mod_MOD_update_atmos_model_dynamics
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/atmos_model.F90:854
 21: #9  0x12cde44 in fcst_run_phase_1
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/module_fcst_grid_comp.F90:1306
 21: #10  0x8dfb40 in ???
 21: #11  0x8dfeb4 in ???
 21: #12  0x7ee006 in ???
 21: #13  0x7f0ed8 in ???
 21: #14  0xde6522 in ???
 21: #15  0x8dde82 in ???
 21: #16  0x80c08c in ???
 21: #17  0xad748e in ???
 21: #18  0x12bbbea in modeladvance_phase1
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/fv3_cap.F90:1026
 21: #19  0x12bbf34 in modeladvance
 21:    at /scratch2/BMC/wrfruc/Samuel.Trahan/westwater/nested-workflow/sorc/ratko-rocky/FV3/fv3_cap.F90:975

@SamuelTrahanNOAA
Copy link
Collaborator

The cpld_control_p8_gnu and cpld_debug_p8_gnu both fail with this message:

 21: --------------------------------------------------------------------------
 21: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
 21: Workarounds are to run on a single node, or to use a system with an RDMA
 21: capable network such as Infiniband.
 21: --------------------------------------------------------------------------

@jkbk2004
Copy link
Collaborator

RegressionTests_hera.log

@jkbk2004
Copy link
Collaborator

/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3224016/control_wam_debug_gnu

@SamuelTrahanNOAA all tests pass on my side. @zach1221 @FernandoAndrade-NOAA can you test gnu cases on hera/rocky8?

@SamuelTrahanNOAA
Copy link
Collaborator

Did your tests pass on the first try or did you have to rerun them?

@jkbk2004
Copy link
Collaborator

Did your tests pass on the first try or did you have to rerun them?

It passed with the first try. A few other people are running gnu cases now. We can confirm.

@zach1221
Copy link
Collaborator

The cpld_control_p8_gnu and cpld_debug_p8_gnu both fail with this message:

 21: --------------------------------------------------------------------------
 21: The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
 21: Workarounds are to run on a single node, or to use a system with an RDMA
 21: capable network such as Infiniband.
 21: --------------------------------------------------------------------------

I am also receiving this error on Hera, for cpld_control_p8_gnu and cpld_debug_p8_gnu.

@jkbk2004
Copy link
Collaborator

@RatkoVasic-NOAA it sounds like different result with case-by-case. Some nodes still heterogeneous? openmpi or gcc version issue?

@SamuelTrahanNOAA
Copy link
Collaborator

@zach1221 - Can you reproduce the error I saw with control_wam_debug_gnu?

It may have been caused by the job being sent to the wrong service (login) node.

@jkbk2004
Copy link
Collaborator

I am not sure if we are triggering -mcmodel=medium on hera/gnu.

@BrianCurtis-NOAA
Copy link
Collaborator

For ecflow, if used: I have a feeling maybe the ECF_HOST env var on hera isn't set properly with this transition? I logged into a rocky8 node and 'module load ecflow' and 'printenv | grep ECF' and only the _ROOT env var showed up. Try manually setting the hera ecflow ECF_HOST var to (i think) hfe12 and see if that helps? (if needed)

@zach1221
Copy link
Collaborator

@zach1221 - Can you reproduce the error I saw with control_wam_debug_gnu?

It may have been caused by the job being sent to the wrong service (login) node.

With my first test, where cpld_control_p8_gnu and cpld_debug_p8_gnu failed, control_wam_debug_gnu actually passed. I'm retesting now with some changes to cmake.gnu .

@SamuelTrahanNOAA
Copy link
Collaborator

I used Rocoto and saw those bugs. That means the problem is not specific to ecFlow.

@jkbk2004
Copy link
Collaborator

@climbfuji I am not sure about OSC pt2pt issue. I vaguely remember a similar issue was seen with openmpi on Hercules. Do you remember? @RatkoVasic-NOAA @ulmononian Any comment?

@RatkoVasic-NOAA
Copy link
Collaborator Author

@jkbk2004 I'm looking into this right now. I haven't seen this error message before.

@zach1221
Copy link
Collaborator

zach1221 commented Mar 20, 2024

@jkbk2004 forcing to run on nodes 5-12 didnt work, failing with same OSC pt2pt error. The GNU.cmake update test timed out, so running it again with manually extended time.

Update: the cpld_control_p8_gnu test failed with same error after adding -mcmodel=large & medium to gnu.cmake.

@RatkoVasic-NOAA
Copy link
Collaborator Author

@jkbk2004 we've decided to proceed with testing on this PR, and disable the failing gnu cases for Hera and enable the one control_wam_debug_gnu on Hercules. I will create a new issue for the failing cases on Hera, and in the meantime @RatkoVasic-NOAA can work to install new spack-stack with gnu/13.2.0 and openmpi/4.16. on Hera. Given that may take some time, we will work only PRs that do not change baselines until new spack-stack installation is complete, and we can hopefully re-enable the Hera cpld gnu cases. @BrianCurtis-NOAA @DeniseWorthen please feel free to add anything I may have missed.

As @climbfuji explained, I'm not going to start working on GNU 13 until all packages are working with this version. Though, I will try on my personal space in the meantime.

@SamuelTrahanNOAA
Copy link
Collaborator

Do you have to use OpenMPI for this? Can't you use an MPICH derivative instead?

@climbfuji
Copy link
Collaborator

Do you have to use OpenMPI for this? Can't you use an MPICH derivative instead?

If you want to use [email protected] then you can't use mpich@4 - don't remember when the bug fix in mapl was merged that allows using mpich@4

@BrianCurtis-NOAA
Copy link
Collaborator

gnu@13 is very likely not going to work yet. I know for sure that packages like mapl didn't build, this was fixed only recently in the authoritative repo.

[email protected] is recommended if you use it as the main compiler, but you can't use it as backend for the Intel compilers - that needs to be anything between 9.2 and 11.x

See https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/discover-scu16/compilers.yaml where the default OS gnu compiler is too old and we have to inject a newer gcc compiler for C++-17 support into the Intel compiler config.

See https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/hercules/compilers.yaml where this isn't necessary, because the default/OS GNU is 11.3.0.

On Hera, the default gcc is 8.5.0 - too old.

Gnu 12.2 is fine with me. I wasn't sure how complicated it would make things on hera. Important to get this started ASAP.

@climbfuji
Copy link
Collaborator

climbfuji commented Mar 21, 2024

Do you have to use OpenMPI for this? Can't you use an MPICH derivative instead?

If you want to use [email protected] then you can't use mpich@4 - don't remember when the bug fix in mapl was merged that allows using mpich@4

[email protected] works with mpich@4 - https://github.com/GEOS-ESM/MAPL/releases/tag/v2.42.0

@RatkoVasic-NOAA
Copy link
Collaborator Author

Gnu 12.2 is fine with me. I wasn't sure how complicated it would make things on hera. Important to get this started ASAP.

We don't have 12.2 on Hera/Rocky. Only 9.2.0 and 13.2.0 (for now).

@zach1221 zach1221 added Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Mar 21, 2024
@BrianCurtis-NOAA
Copy link
Collaborator

Gnu 12.2 is fine with me. I wasn't sure how complicated it would make things on hera. Important to get this started ASAP.

We don't have 12.2 on Hera/Rocky. Only 9.2.0 and 13.2.0 (for now).

How long would it take their SA to install 12.2? Would it be easier to wait for that over trying to get 13.2 working with spack stack?

@RatkoVasic-NOAA
Copy link
Collaborator Author

Gnu 12.2 is fine with me. I wasn't sure how complicated it would make things on hera. Important to get this started ASAP.

We don't have 12.2 on Hera/Rocky. Only 9.2.0 and 13.2.0 (for now).

How long would it take their SA to install 12.2? Would it be easier to wait for that over trying to get 13.2 working with spack stack?

That is good question, meaning: I don't know the answer ;-) I will first try with 13.2.0 and see what we need only for WM.

@climbfuji
Copy link
Collaborator

climbfuji commented Mar 21, 2024

Gnu 12.2 is fine with me. I wasn't sure how complicated it would make things on hera. Important to get this started ASAP.

We don't have 12.2 on Hera/Rocky. Only 9.2.0 and 13.2.0 (for now).

How long would it take their SA to install 12.2? Would it be easier to wait for that over trying to get 13.2 working with spack stack?

That is good question, meaning: I don't know the answer ;-) I will first try with 13.2.0 and see what we need only for WM.

@RatkoVasic-NOAA It won't work since mapl doesn't work with 13.2 and you need that for the UFSWM. The last change for gnu@13 for mapl was apparently merged last week, there isn't even a release yet - GEOS-ESM/MAPL#2640. - EDIT this was for mapl@3.

I don't know which tag if any of mapl@2 works with gnu@13 - @mathomp4 probably knows.

@mathomp4
Copy link

That is good question, meaning: I don't know the answer ;-) I will first try with 13.2.0 and see what we need only for WM.

@RatkoVasic-NOAA It won't work since mapl doesn't work with 13.2 and you need that for the UFSWM. The last change for gnu@13 for mapl was apparently merged last week, there isn't even a release yet - GEOS-ESM/MAPL#2640. - EDIT this was for mapl@3.

I don't know which tag if any of mapl@2 works with gnu@13 - @mathomp4 probably knows.

At the moment no official release of MAPL 2 works with GCC 13. But, MAPL develop (and MAPL3) does thanks to @tclune. It would require the GFE v1.13 libraries as those had workarounds for GCC 13.1

That said, if needed we could release MAPL 2.45 with those fixes...but not that at the moment MAPL 2.44+ doesn't build in spack. That is due to the ESMF::ESMF target business. ESMF has or will soon release a beta snapshot of 8.6.1 that should have the fixes for that in though some upstream repos will need to update their FindESMF.cmake files (though maybe not for spack? confusing sometimes)

Footnotes

  1. I am currently building Baselibs and GEOS to test GCC 13 and the full model. My guess is GEOSgcm should be fine as long as MAPL is. MAPL is where the fancy Fortran is.

@BrianCurtis-NOAA
Copy link
Collaborator

I didn't think we needed to run all systems. No code touches any other system. I thought Hercules was a special case because of the changes to the cpld tests.

@zach1221
Copy link
Collaborator

I didn't think we needed to run all systems. No code touches any other system. I thought Hercules was a special case because of the changes to the cpld tests.

Ok, I was just being safe. We don't have to finish jet and wcoss2/acorn if you don't think it's necessary. But yes, you're correct, only hercules/hera had changes.

@zach1221
Copy link
Collaborator

@BrianCurtis-NOAA @DeniseWorthen @jkbk2004 testing is complete. Feel free to provide final review.

@zach1221 zach1221 merged commit 7fdb58c into ufs-community:develop Mar 22, 2024
@jkbk2004
Copy link
Collaborator

@aerorahul We moved to rocky8. FYI: we will revisit about the gnu/openmpi issue on rocky8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need new modules for ufs build on hera rocky nodes, jet rocky nodes, and orion rocky nodes
10 participants