Skip to content

Version 2.2 Release

Compare
Choose a tag to compare
@gdicker1 gdicker1 released this 03 Jun 20:43
· 22 commits to main since this release

Version 2.2 Functional Release

The EarthWorks Version 2.2 release introduces these new features:

Multi-Platform Support: Version 2.2 is the first EarthWorks release with multi platform support. We have added GH1, a Grace-Hopper system at the Texas Advanced Computing Center (TACC). GH1 consists of two Grace-Hopper NVIDIA nodes: one compute node and one login/compile node. The Grace node component is an ARM-V9 72-core processor, Hopper is an NVIDIA Tesla H-100 GPU. We expect a substantially larger multi-node test system called Vista to replace GH1. Some caveats for GH1 release include:

  • GH1 testing has been performed with the NVHPC 24.1 compiler only.
  • GH1 Grace (CPU) testing has been performed for the FHS94, FKESSLER, and QPC6 (Aquaplanet) compsets only.
  • GH1 Hopper (GPU) offload testing has been performed for the FHS94 (Held-Suarez) test case only. MPS has been verified to work in this case of GPU offload.

NSF NCAR’s Derecho supercomputer remains the principal system supported in the EarthWorks release.

Multi-Component GPU Offload: Version 2.2 is the first functional release of a multi-component GPU offload capability, including the MPAS dynamical core, PUMAS microphysics (pumas_cam-release_v1.36) and RRTMG-P radiative transfer physics code. The release comes with the following caveats:

  • Basic functionality and correctness of the multi-component GPU offload release has been tested on F2000devEW compset only. We plan a more complete matrix of correctness tests of other compsets and resolutions in a later release.
  • The performance of the GPU offload version (particularly of the physics) has not been fully optimized.
  • We have not confirmed multi-component GPU offload on the Grace-Hopper platform.

Defining Compsets & Enabling create_test: The new approach provides a more “CESM-like” create, build, run test environment. This includes definitions of tests to be used with CIME’s create_test workflow, adjustments to default values (coupling intervals and component timesteps), and definitions of some commonly used EarthWorks-specific compsets. These additions will make testing EarthWorks simpler in the future and will allow generation and comparison against baselines.

The newly added compsets added include:

  • F2000climoEW: An analogue to the F2000climo compset in CAM, but with the CICE prescribed mode swapped for the MPAS-SI prescribed mode.
  • F2000devEW: An analogue to the F2000dev compset in CAM, again with MPAS-SI prescribed mode instead of CICE.
  • FullyCoupledEW: A compset that has been mentioned in other releases, formalized here. It uses active MPAS components for the atmosphere, ocean, and seaice.
  • CHAOS2000: The Coupled Hexagonal Atmosphere, Ocean, and Seaice compset. Like FullyCoupledEW, but with an active river-runoff (MOSART) component as well.
  • CHAOS2000dev: Uses “cam_dev” physics by default instead of “CAM6” physics.

The tests are defined for Derecho and grouped into the following categories:

  • ew-pr: contains some tests that are expected to be run when creating a PR to try to catch bugs, reversions, or changes that may affect EarthWorks. These tests try to consume a low amount of core-hours, so they are not exhaustive. In this release they are 5 day “smoke tests” (forward run only), on 120km, for each supported compset, with various compilers.
  • ew-ver: contains tests that can be run to verify the correctness of EarthWorks (especially versus CESM). In this release the only test described is a 1200 day “smoke test” of FHS94 to match what’s described in https://www.cesm.ucar.edu/models/simple/held-suarez. This group will be expanded in future releases.
  • ew-rel: contains a broader range of test cases that the EarthWorks team expects to pass (along with ew-pr) before creating a release. In this release we tested the CHAOS2000dev compset using an 11 day “exact restart” test, for a few resolutions, and for both the Intel and NVHPC compilers. These are a starting point, and will be expanded in future releases.

New Documentation: As we create more releases we hope to grow the community around EarthWorks. These documents help set some ground rules, start guiding potential contributors, and define the development practices already in place. These guides include:

Description of Model Configurations (Compsets)

See EarthWorks Supported Configurations in the GitHub wiki for more details.

Testing

Tested Systems

NSF NCAR’s Derecho Supercomputer

The majority of tests occurred on Derecho.

CPU-only hardware Derecho’s CPU-only nodes consisted of dual-socket, 64-core, 3rd Gen AMD EPYC™ 7763 Milan processors with 256 GB of DDR4 memory.

CPU/GPU hybrid hardware Derecho has GPU nodes consisting of single-socket, 64-core, 3rd Gen AMD EPYC™ 7763 Milan processor with 512 GB of DDR4 memory plus 4 NVIDIA A100 GPUs each with 40 GB of onboard memory.

TACC’s GH1 Test System

CPU/GPU hybrid hardware GH1 has 1 login/compile node and 1 compute node with the same hardware on each. The Grace (CPU) component is an ARM-V9 72-core processor, the Hopper component is an NVIDIA Tesla H-100 GPU.

Tested Software Stacks

Compiler Versions

Derecho:

  • ifort (Intel Classic compiler version 2023.2.1)
  • ifx (Intel OneAPI version 2023..2.1)
  • Nvfortran (NVHPC fortran compiler version 24.3)
  • Gnufortran (compiler version 12.2.0)

GH1:

  • Nvfortran (NVHPC fortran compiler version 24.1)

Libraries

Derecho:

  • MPI (Cray MPICH version 8.1.27)
  • Parallel-NetCDF (version 1.12.3)
  • PIO2 (version 2.6.2)
  • ESMF (version 8.6.0)

GH1:

  • MPI (OpenMPI version 4.1.7a1)
  • Parallel-NetCDF (version 1.12.3)
  • PIO2 (version 2.6.2)
  • ESMF (version v8.7.0b05)

Testing Results

Derecho create_test Results

To test this release, CPU-only tests were carried out on Derecho using the ew-pr and ew-rel categories as described above.

5 Day Smoke Tests (ew-pr)

  • 120km FHS94 with GNU, Intel-OneAPI, and NVHPC (Overall: PASS)
  • 120km FKESSLER with NVHPC (Overall: PASS)
  • 120km QPC6 with NVHPC (Overall: PASS)
  • 120km F2000climoEW with NVHPC (Overall: FAIL)
    • This test failed since the wrong resolution (mpasa120_mpasa120) was requested. Since this test uses MPASSI%PRES mode, it must use an oQU120 grid for ocean and sea ice. This is corrected but untested in this release.
  • 120km FullyCoupledEW with NVHPC (Overall: FAIL)
    • Same issue as with CHAOS2000dev below.
  • 120km CHAOS2000dev with NVHPC, Intel, and GNU (Overall: FAIL)
    • These tests ran through every step successfully except the final short-term archiving step. The archiver doesn’t know which files to copy for MPAS-O and MPAS-SI components, this will be resolved in a future release. Message: ERROR: No archive entry found for components: ['ICE', 'OCN']

11 Day Exact Restart Tests (ew-rel)

  • 120km, 32L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
    • Failed due to errors accumulating in CLUBB routines leading to a segmentation fault. Messages from the run display Infinity and NaN values in array invrs_tau_xp2_zm and “Error calling advance_xp2_xpyp”.
  • 120km, 58L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 60km CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 30km CHAOS2000dev with NVHPC and Intel (Overall: FAIL)
  • 15km, 58L CHAOS2000dev with NVHPC and Intel (Overall: FAIL)

Known Issues

  • NVHPC compilers (tested nvfortran version 23.X, from previous releases): Initializing from restart fails.
    • Additional details: Any configuration (tested with QPC6, F2000ClimoEW, FullyCoupledEW) that attempts to restart from a previous run will fail in CAM subroutine dyn_init.
    • Resolutions affected: all supported resolution/level combinations.
    • Work around: Run without restart.

Known issues by compset/compiler/resolution (CPU-only):

See the Derecho create_test Results above

Known issues by compset/compiler/resolution (Hybrid CPU-GPU)

  • Drastic effect of Physics Columns (PCOLS) on GPU performance
    • Additional details: PCOLS sets the number of columns an MPI rank processes during the run. When running multiple physics on GPUs, set PCOLS to a bigger number.
    • Resolutions affected: all supported resolution/level combinations that use a combination of GPUs, cam_dev physics, and rrtmgp_gpu radiation.
    • Work around: Change PCOLS using xmlchange during the setup of a case. E.g. for a case just created, use this command to request rrtmgp_gpu and set a valid PCOLS value: ./xmlchange --append CAM_CONFIG_OPTS="-rad rrtmgp_gpu -pcols 2048"
    • NOTE: 2048 is the maximum amount of PCOLS we can use on Derecho with NVHPC. Any number greater than 2048 causes a build error. Numbers below 2048 result in worse performance in our test cases.