cluster-toolkit/community/modules/compute/schedmd-slurm-gcp-v5-partition at main · GoogleCloudPlatform/cluster-toolkit

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
main.tf		main.tf
metadata.yaml		metadata.yaml
outputs.tf		outputs.tf
variables.tf		variables.tf
versions.tf		versions.tf

README.md

Description

This module creates a compute partition that can be used as input to the schedmd-slurm-gcp-v5-controller.

The partition module is designed to work alongside the schedmd-slurm-gcp-v5-node-group module. A partition can be made up of one or more node groups, provided either through use (preferred) or defined manually in the node_groups variable.

Warning: updating a partition and running terraform apply will not cause the slurm controller to update its own configurations (slurm.conf) unless enable_reconfigure is set to true in the partition and controller modules.

Example

The following code snippet creates a partition module with:

2 node groups added via use.
- The first node group is made up of machines of type c2-standard-30.
- The second node group is made up of machines of type c2-standard-60.
- Both node groups have a maximum count of 200 dynamically created nodes.
partition name of "compute".
connected to the network1 module via use.
nodes mounted to homefs via use.

- id: node_group_1
  source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
  settings:
    name: c30
    node_count_dynamic_max: 200
    machine_type: c2-standard-30

- id: node_group_2
  source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
  settings:
    name: c60
    node_count_dynamic_max: 200
    machine_type: c2-standard-60

- id: compute_partition
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition
  use:
  - network1
  - homefs
  - node_group_1
  - node_group_2
  settings:
    partition_name: compute

For a complete example using this module, see slurm-gcp-v5-cluster.yaml.

Compute VM Zone Policies

The Slurm on GCP partition module allows you to specify additional zones in which to create VMs through bulk creation. This is valuable when configuring partitions with popular VM families and you desire access to more compute resources across zones.

WARNING: Lenient zone policies can lead to additional egress costs when moving large amounts of data between zones in the same region. For example, traffic between VMs and traffic from VMs to shared filesystems such as Filestore. For more information on egress fees, see the Network Pricing Google Cloud documentation.

To avoid egress charges, ensure your compute nodes are created in a single zone by setting var.zone and leaving var.zones to its default value of the empty list.

NOTE: If a new zone is added to the region while the cluster is active, nodes in the partition may be created in that zone. In this case, the partition may need to be redeployed (possible via enable_reconfigure if set) to ensure the newly added zone is denied.

In the zonal example below, the partition's zone implicitly defaults to the deployment variable vars.zone:

vars:
  zone: us-central1-f

- id: zonal-partition
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition

In the example below, we enable creation in additional zones:

vars:
  zone: us-central1-f

- id: multi-zonal-partition
  source: community/modules/compute/schedmd-slurm-gcp-v5-partition
  settings:
    zones:
    - us-central1-a
    - us-central1-b

Support

The Cluster Toolkit team maintains the wrapper around the slurm-on-gcp terraform modules. For support with the underlying modules, see the instructions in the slurm-gcp README.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name	Version
terraform	>= 0.13.0
google	>= 5.11

Providers

Name	Version
google	>= 5.11

Modules

Name	Source	Version
slurm_partition	github.com/GoogleCloudPlatform/slurm-gcp.git//terraform/slurm_cluster/modules/slurm_partition	5.12.0

Resources

Name	Type
google_compute_reservation.reservation	data source
google_compute_zones.available	data source

Inputs

Name	Description	Type	Default	Required
deployment_name	Name of the deployment.	`string`	n/a	yes
enable_placement	Enable placement groups.	`bool`	`true`	no
enable_reconfigure	Enables automatic Slurm reconfigure on when Slurm configuration changes (e.g. slurm.conf.tpl, partition details). Compute instances and resource policies (e.g. placement groups) will be destroyed to align with new configuration. NOTE: Requires Python and Google Pub/Sub API. WARNING: Toggling this will impact the running workload. Deployed compute nodes will be destroyed and their jobs will be requeued.	`bool`	`false`	no
exclusive	Exclusive job access to nodes.	`bool`	`true`	no
is_default	Sets this partition as the default partition by updating the partition_conf. If "Default" is already set in partition_conf, this variable will have no effect.	`bool`	`false`	no
network_storage	An array of network attached storage mounts to be configured on the partition compute nodes.	list(object({ server_ip = string, remote_mount = string, local_mount = string, fs_type = string, mount_options = string, client_install_runner = map(string) mount_runner = map(string) }))	`[]`	no
node_groups	A list of node groups associated with this partition. See schedmd-slurm-gcp-v5-node-group for more information on defining a node group in a blueprint.	list(object({ node_count_static = number node_count_dynamic_max = number group_name = string node_conf = map(string) access_config = list(object({ nat_ip = string network_tier = string })) additional_disks = list(object({ disk_name = string device_name = string disk_size_gb = number disk_type = string disk_labels = map(string) auto_delete = bool boot = bool })) additional_networks = list(object({ network = string subnetwork = string subnetwork_project = string network_ip = string nic_type = string stack_type = string queue_count = number access_config = list(object({ nat_ip = string network_tier = string })) ipv6_access_config = list(object({ network_tier = string })) alias_ip_range = list(object({ ip_cidr_range = string subnetwork_range_name = string })) })) bandwidth_tier = string can_ip_forward = bool disable_smt = bool disk_auto_delete = bool disk_labels = map(string) disk_size_gb = number disk_type = string enable_confidential_vm = bool enable_oslogin = bool enable_shielded_vm = bool enable_spot_vm = bool gpu = object({ count = number type = string }) instance_template = string labels = map(string) machine_type = string maintenance_interval = string metadata = map(string) min_cpu_platform = string on_host_maintenance = string preemptible = bool reservation_name = string service_account = object({ email = string scopes = list(string) }) shielded_instance_config = object({ enable_integrity_monitoring = bool enable_secure_boot = bool enable_vtpm = bool }) spot_instance_config = object({ termination_action = string }) source_image_family = string source_image_project = string source_image = string tags = list(string) }))	`[]`	no
partition_conf	Slurm partition configuration as a map. See https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION	`map(string)`	`{}`	no
partition_name	The name of the slurm partition.	`string`	n/a	yes
partition_startup_scripts_timeout	The timeout (seconds) applied to the partition startup script. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled.	`number`	`300`	no
project_id	Project in which the HPC deployment will be created.	`string`	n/a	yes
region	The default region for Cloud resources.	`string`	n/a	yes
slurm_cluster_name	Cluster name, used for resource naming and slurm accounting. If not provided it will default to the first 8 characters of the deployment name (removing any invalid characters).	`string`	`null`	no
startup_script	Startup script that will be used by the partition VMs.	`string`	`""`	no
subnetwork_project	The project the subnetwork belongs to.	`string`	`""`	no
subnetwork_self_link	Subnet to deploy to.	`string`	`null`	no
zone	Zone in which to create compute VMs. Additional zones in the same region can be specified in var.zones.	`string`	n/a	yes
zone_target_shape	Strategy for distributing VMs across zones in a region. ANY GCE picks zones for creating VM instances to fulfill the requested number of VMs within present resource constraints and to maximize utilization of unused zonal reservations. ANY_SINGLE_ZONE (default) GCE always selects a single zone for all the VMs, optimizing for resource quotas, available reservations and general capacity. BALANCED GCE prioritizes acquisition of resources, scheduling VMs in zones where resources are available while distributing VMs as evenly as possible across allowed zones to minimize the impact of zonal failure.	`string`	`"ANY_SINGLE_ZONE"`	no
zones	Additional nodes in which to allow creation of partition nodes. Google Cloud will find zone based on availability, quota and reservations.	`set(string)`	`[]`	no

Outputs

Name	Description
partition	Details of a slurm partition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedmd-slurm-gcp-v5-partition

schedmd-slurm-gcp-v5-partition

README.md

Description

Example

Compute VM Zone Policies

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

schedmd-slurm-gcp-v5-partition

Directory actions

More options

Directory actions

More options

Latest commit

History

schedmd-slurm-gcp-v5-partition

Folders and files

parent directory

README.md

Description

Example

Compute VM Zone Policies

Support

License

Requirements

Providers

Modules

Resources

Inputs

Outputs