What's Changed
Important
All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".
Key New Features 🎉
- GKE A3 High blueprint and GKE A3 Mega blueprint with automated GPU networking performance enhancements
- Add enable-maintenance-reservation flag in slurm to control reservation for scheduled maintenance by @harshthakkar01 in #2987
- adding documentation for versioned blueprint feature by @RachaelSTamakloe in #3055
- adding unit test for version blueprint caching mechanism by @RachaelSTamakloe in #3052
New Modules 🧱
- implement kubectl-apply module by @sharabiani in #2980
Module Improvements 🔨
- Default to zonal bulkInsert by @mr0re1 in #3005
- Add machine type availability checks by @annuay-google in #3003
- add support for enabling tcpx/o in a3 and a3mega vm, provide script for injecting rxdm sidecar and other required components into user workload by @chengcongdu in #3012
- support ghpc_stage function in kubectl-apply module by @sharabiani in #3036
- Validate Reservations in GKE Blueprints by @arajmane-g in #3024
- Fix multivpc missing region by @wiktorn in #3046
- Add initial_node_count support to gke-node-pool by @sharabiani in #3068
Improvements 🛠
- Update gVNIC driver in A3 Mega solution by @tpdownes in #2957
- Implement udev-based approach to mounting aperture devices by @tpdownes in #2955
- Update Debian 12 image in A3 Mega solution by @tpdownes in #2958
- adding module cache to prevent repeated module downloads during modul… by @RachaelSTamakloe in #3010
- add additional vpc validation for a3/a3mega machine by @chengcongdu in #3049
- Adds option to allow Kueue/Jobset to be installed on a GKE cluster via blueprints by @ankitkinra in #3017
- update readme for gpudirect by @chengcongdu in #3059
Deprecations 💤
- SlurmGCP V6. Remove CentOS7 image support. by @mr0re1 in #3038
- removing deprecated spack setup variables by @RachaelSTamakloe in #3040
- removing deprecated ramble setup variables by @RachaelSTamakloe in #3041
Version Updates ⏫
- Update NeMo 23.11 to 24.07 by @akiki-liang0 in #3090
Bug fixes 🐞
- Retry mounting daos container by @harshthakkar01 in #3045
- add argparse dependency to cloud build by @chengcongdu in #3057
- Allow users to provide a commit hash instead of git tag for Spack and Ramble installations by @rohitramu in #3073
- resolving error when var.initial_node_count is null by @RachaelSTamakloe in #3081
- A3 High blueprint prolog solution updates by @tpdownes in #3088
Other changes
- NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075
New Contributors
- @koallison made their first contribution in #3075
- @akiki-liang0 made their first contribution in #3090
Full Changelog: v1.39.0...v1.40.0