Skip to content

Latest commit

 

History

History
115 lines (99 loc) · 25 KB

README_TF.md

File metadata and controls

115 lines (99 loc) · 25 KB

slurm_cluster

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Requirements

Name Version
terraform ~> 1.3
google >= 3.53, < 5.0
random ~> 3.0

Providers

Name Version
google >= 3.53, < 5.0

Modules

Name Source Version
bucket terraform-google-modules/cloud-storage/google ~> 3.0
slurm_controller_hybrid ./modules/slurm_controller_hybrid n/a
slurm_controller_instance ./modules/slurm_controller_instance n/a
slurm_controller_template ./modules/slurm_instance_template n/a
slurm_files ./modules/slurm_files n/a
slurm_login_instance ./modules/slurm_login_instance n/a
slurm_login_template ./modules/slurm_instance_template n/a
slurm_nodeset ./modules/slurm_nodeset n/a
slurm_nodeset_dyn ./modules/slurm_nodeset_dyn n/a
slurm_nodeset_template ./modules/slurm_instance_template n/a
slurm_nodeset_tpu ./modules/slurm_nodeset_tpu n/a
slurm_partition ./modules/slurm_partition n/a

Resources

Name Type
google_storage_bucket_iam_binding.legacyReaders resource
google_storage_bucket_iam_binding.viewers resource
google_compute_subnetwork.nodeset_subnetwork data source

Inputs

Name Description Type Default Required
bucket_dir Bucket directory for cluster files to be put into. If not specified, then one will be chosen based on slurm_cluster_name. string null no
bucket_name Name of GCS bucket.
Ignored when 'create_bucket' is true.
string null no
cgroup_conf_tpl Slurm cgroup.conf template file path. string null no
cloud_parameters cloud.conf options.
object({
no_comma_params = optional(bool, false)
resume_rate = optional(number, 0)
resume_timeout = optional(number, 300)
suspend_rate = optional(number, 0)
suspend_timeout = optional(number, 300)
})
{} no
cloudsql Use this database instead of the one on the controller.
* server_ip : Address of the database server.
* user : The user to access the database as.
* password : The password, given the user, to access the given database. (sensitive)
* db_name : The database to access.
object({
server_ip = string
user = string
password = string # sensitive
db_name = string
})
null no
compute_startup_scripts List of scripts to be ran on compute VM startup.
list(object({
filename = string
content = string
}))
[] no
compute_startup_scripts_timeout The timeout (seconds) applied to each script in compute_startup_scripts. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
controller_hybrid_config Creates a hybrid controller with given configuration.
See 'main.tf' for valid keys.
object({
google_app_cred_path = optional(string)
slurm_control_host = optional(string)
slurm_control_host_port = optional(string)
slurm_control_addr = optional(string)
slurm_bin_dir = optional(string)
slurm_log_dir = optional(string)
output_dir = optional(string)
install_dir = optional(string)
munge_mount = optional(object({
server_ip = optional(string)
remote_mount = optional(string, "/etc/munge")
fs_type = optional(string, "nfs")
mount_options = optional(string)
}), {})
})
{} no
controller_instance_config Creates a controller instance with given configuration.
object({
additional_disks = optional(list(object({
disk_name = optional(string)
device_name = optional(string)
disk_size_gb = optional(number)
disk_type = optional(string)
disk_labels = optional(map(string), {})
auto_delete = optional(bool, true)
boot = optional(bool, false)
})), [])
bandwidth_tier = optional(string, "platform_default")
can_ip_forward = optional(bool, false)
disable_smt = optional(bool, false)
disk_auto_delete = optional(bool, true)
disk_labels = optional(map(string), {})
disk_size_gb = optional(number)
disk_type = optional(string, "n1-standard-1")
enable_confidential_vm = optional(bool, false)
enable_public_ip = optional(bool, false)
enable_oslogin = optional(bool, true)
enable_shielded_vm = optional(bool, false)
gpu = optional(object({
count = number
type = string
}))
instance_template = optional(string)
labels = optional(map(string), {})
machine_type = optional(string)
metadata = optional(map(string), {})
min_cpu_platform = optional(string)
network_ip = optional(string)
network_tier = optional(string, "STANDARD")
on_host_maintenance = optional(string)
preemptible = optional(bool, false)
region = optional(string)
service_account = optional(object({
email = optional(string)
scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])
}))
shielded_instance_config = optional(object({
enable_integrity_monitoring = optional(bool, true)
enable_secure_boot = optional(bool, true)
enable_vtpm = optional(bool, true)
}))
source_image_family = optional(string)
source_image_project = optional(string)
source_image = optional(string)
spot = optional(bool, false)
static_ip = optional(string)
subnetwork_project = optional(string)
subnetwork = optional(string)
tags = optional(list(string), [])
termination_action = optional(string)
zone = optional(string)
})
{} no
controller_startup_scripts List of scripts to be ran on controller VM startup.
list(object({
filename = string
content = string
}))
[] no
controller_startup_scripts_timeout The timeout (seconds) applied to each script in controller_startup_scripts. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
create_bucket Create GCS bucket instead of using an existing one. bool true no
disable_default_mounts Disable default global network storage from the controller
* /usr/local/etc/slurm
* /etc/munge
* /home
* /apps
If these are disabled, the slurm etc and munge dirs must be added manually,
or some other mechanism must be used to synchronize the slurm conf files
and the munge key across the cluster.
bool false no
enable_bigquery_load Enables loading of cluster job usage into big query.

NOTE: Requires Google Bigquery API.
bool false no
enable_cleanup_compute Enables automatic cleanup of compute nodes and resource policies (e.g.
placement groups) managed by this module, when cluster is destroyed.

NOTE: Requires Python and script dependencies.

WARNING: Toggling this may impact the running workload. Deployed compute nodes
may be destroyed and their jobs will be requeued.
bool false no
enable_debug_logging Enables debug logging mode. Not for production use. bool false no
enable_devel Enables development mode. Not for production use. bool false no
enable_hybrid Enables use of hybrid controller mode. When true, controller_hybrid_config will
be used instead of controller_instance_config and will disable login instances.
bool false no
enable_login Enables the creation of login nodes and instance templates. bool true no
enable_slurm_gcp_plugins Enables calling hooks in scripts/slurm_gcp_plugins during cluster resume and suspend. any false no
epilog_scripts List of scripts to be used for Epilog. Programs for the slurmd to execute
on every node when a user's job completes.
See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog.
list(object({
filename = string
content = string
}))
[] no
extra_logging_flags The list of extra flags for the logging system to use. See the logging_flags variable in scripts/util.py to get the list of supported log flags. map(bool) {} no
login_network_storage Storage to mounted on login and controller instances
* server_ip : Address of the storage server.
* remote_mount : The location in the remote instance filesystem to mount from.
* local_mount : The location on the instance filesystem to mount to.
* fs_type : Filesystem type (e.g. "nfs").
* mount_options : Options to mount with.
list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
}))
[] no
login_nodes List of slurm login instance definitions.
list(object({
additional_disks = optional(list(object({
disk_name = optional(string)
device_name = optional(string)
disk_size_gb = optional(number)
disk_type = optional(string)
disk_labels = optional(map(string), {})
auto_delete = optional(bool, true)
boot = optional(bool, false)
})), [])
bandwidth_tier = optional(string, "platform_default")
can_ip_forward = optional(bool, false)
disable_smt = optional(bool, false)
disk_auto_delete = optional(bool, true)
disk_labels = optional(map(string), {})
disk_size_gb = optional(number)
disk_type = optional(string, "n1-standard-1")
enable_confidential_vm = optional(bool, false)
enable_public_ip = optional(bool, false)
enable_oslogin = optional(bool, true)
enable_shielded_vm = optional(bool, false)
gpu = optional(object({
count = number
type = string
}))
group_name = string
instance_template = optional(string)
labels = optional(map(string), {})
machine_type = optional(string)
metadata = optional(map(string), {})
min_cpu_platform = optional(string)
network_tier = optional(string, "STANDARD")
num_instances = optional(number, 1)
on_host_maintenance = optional(string)
preemptible = optional(bool, false)
region = optional(string)
service_account = optional(object({
email = optional(string)
scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])
}))
shielded_instance_config = optional(object({
enable_integrity_monitoring = optional(bool, true)
enable_secure_boot = optional(bool, true)
enable_vtpm = optional(bool, true)
}))
source_image_family = optional(string)
source_image_project = optional(string)
source_image = optional(string)
static_ips = optional(list(string), [])
subnetwork_project = optional(string)
subnetwork = optional(string)
spot = optional(bool, false)
tags = optional(list(string), [])
zone = optional(string)
termination_action = optional(string)
}))
[] no
login_startup_scripts List of scripts to be ran on login VM startup.
list(object({
filename = string
content = string
}))
[] no
login_startup_scripts_timeout The timeout (seconds) applied to each script in login_startup_scripts. If
any script exceeds this timeout, then the instance setup process is considered
failed and handled accordingly.

NOTE: When set to 0, the timeout is considered infinite and thus disabled.
number 300 no
network_storage Storage to mounted on all instances.
* server_ip : Address of the storage server.
* remote_mount : The location in the remote instance filesystem to mount from.
* local_mount : The location on the instance filesystem to mount to.
* fs_type : Filesystem type (e.g. "nfs").
* mount_options : Options to mount with.
list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
}))
[] no
nodeset Define nodesets, as a list.
list(object({
node_count_static = optional(number, 0)
node_count_dynamic_max = optional(number, 1)
node_conf = optional(map(string), {})
nodeset_name = string
additional_disks = optional(list(object({
disk_name = optional(string)
device_name = optional(string)
disk_size_gb = optional(number)
disk_type = optional(string)
disk_labels = optional(map(string), {})
auto_delete = optional(bool, true)
boot = optional(bool, false)
})), [])
bandwidth_tier = optional(string, "platform_default")
can_ip_forward = optional(bool, false)
disable_smt = optional(bool, false)
disk_auto_delete = optional(bool, true)
disk_labels = optional(map(string), {})
disk_size_gb = optional(number)
disk_type = optional(string)
enable_confidential_vm = optional(bool, false)
enable_placement = optional(bool, false)
enable_public_ip = optional(bool, false)
enable_oslogin = optional(bool, true)
enable_shielded_vm = optional(bool, false)
gpu = optional(object({
count = number
type = string
}))
instance_template = optional(string)
labels = optional(map(string), {})
machine_type = optional(string)
metadata = optional(map(string), {})
min_cpu_platform = optional(string)
network_tier = optional(string, "STANDARD")
on_host_maintenance = optional(string)
preemptible = optional(bool, false)
region = optional(string)
reservation_name = optional(string)
service_account = optional(object({
email = optional(string)
scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])
}))
shielded_instance_config = optional(object({
enable_integrity_monitoring = optional(bool, true)
enable_secure_boot = optional(bool, true)
enable_vtpm = optional(bool, true)
}))
source_image_family = optional(string)
source_image_project = optional(string)
source_image = optional(string)
subnetwork_project = optional(string)
subnetwork = optional(string)
spot = optional(bool, false)
tags = optional(list(string), [])
termination_action = optional(string)
zones = optional(list(string), [])
zone_target_shape = optional(string, "ANY_SINGLE_ZONE")
}))
[] no
nodeset_dyn Defines nodesets (dynamic), as a list.
list(object({
nodeset_name = string
nodeset_feature = string
}))
[] no
nodeset_tpu Define TPU nodesets, as a list.
list(object({
node_count_static = optional(number, 0)
node_count_dynamic_max = optional(number, 1)
nodeset_name = string
enable_public_ip = optional(bool, false)
node_type = optional(string)
accelerator_config = optional(object({
topology = string
version = string
}), {
topology = ""
version = ""
})
tf_version = string
preemptible = optional(bool, false)
preserve_tpu = optional(bool, true)
zone = string
data_disks = optional(list(string), [])
docker_image = optional(string, "")
subnetwork = optional(string, "")
service_account = optional(object({
email = optional(string)
scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])
}))
}))
[] no
partitions Cluster partitions as a list. See module slurm_partition.
list(object({
default = optional(bool, false)
enable_job_exclusive = optional(bool, false)
network_storage = optional(list(object({
server_ip = string
remote_mount = string
local_mount = string
fs_type = string
mount_options = string
})), [])
partition_conf = optional(map(string), {})
partition_name = string
partition_nodeset = optional(list(string), [])
partition_nodeset_dyn = optional(list(string), [])
partition_nodeset_tpu = optional(list(string), [])
resume_timeout = optional(number)
suspend_time = optional(number, 300)
suspend_timeout = optional(number)
}))
n/a yes
project_id Project ID to create resources in. string n/a yes
prolog_scripts List of scripts to be used for Prolog. Programs for the slurmd to execute
whenever it is asked to run a job step from a new job allocation.
See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog.
list(object({
filename = string
content = string
}))
[] no
region The default region to place resources in. string n/a yes
slurm_cluster_name Cluster name, used for resource naming and slurm accounting. string n/a yes
slurm_conf_tpl Slurm slurm.conf template file path. string null no
slurmdbd_conf_tpl Slurm slurmdbd.conf template file path. string null no

Outputs

Name Description
cloud_logging_filter Cloud Logging filter to find startup errors.
cluster_config Slurm partition details.
slurm_bucket_path Bucket path used by cluster.
slurm_cluster_name Slurm cluster name.
slurm_controller_instance_details Slurm controller instance details.
slurm_controller_instance_self_links Slurm controller instance self_link.
slurm_controller_instances Slurm controller instance object details.
slurm_login_instance_details Slurm login instance details.
slurm_login_instance_self_links Slurm login instance self_link.
slurm_nodeset Slurm nodeset details.
slurm_nodeset_dyn Slurm partition details.
slurm_partition Slurm partition details.