Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy HPC environments on Google Cloud.
In this tutorial you will use the Cluster Toolkit to:
- Deploy a Slurm HPC cluster on Google Cloud
- Use Spack to install the OpenFOAM application and all of its dependencies
- Run a OpenFOAM job on your newly provisioned cluster
- Tear down the cluster
Estimated time to complete: The tutorial takes 3 hr. to complete, of which 2.5 hr is for installing software (without cache).
NOTE: With a complete Spack cache, the tutorial takes 30 min.
Select a project in which to deploy an HPC cluster on Google.
Once you have selected a project, click START.
In a new Google Cloud project there are several apis that must be enabled to
deploy your HPC cluster. These will be caught when you perform ./gcluster create
but you can save time by enabling them now by running:
We also need to grant the default compute service account project edit access so the slurm controller can perform actions such as auto-scaling.
PROJECT_NUMBER=$(gcloud projects list --filter=<walkthrough-project-id/> --format='value(PROJECT_NUMBER)')
echo "granting roles/editor to $PROJECT_NUMBER[email protected]"
gcloud iam service-accounts enable --project <walkthrough-project-id/> $PROJECT_NUMBER[email protected]
gcloud projects add-iam-policy-binding <walkthrough-project-id/> --member=serviceAccount:$PROJECT_NUMBER[email protected] --role=roles/editor
To build Cluster Toolkit binary from source run:
make
You should now have a binary named gcluster in the current directory. To verify the build run:
./gcluster --version
This should show you the version of the Cluster Toolkit you are using.
This tutorial will use the blueprint docs/tutorials/openfoam/spack-openfoam.yaml, which should be open in the Cloud Shell Editor (on the left).
This file describes the cluster you will deploy. It defines:
- a vpc network
- a monitoring dashboard with metrics on your cluster
- a definition of a custom Spack installation
- a startup script that
- installs ansible
- installs Spack & OpenFOAM using the definition above
- sets up a Spack environment including downloading an example input deck
- places a submission script on a shared drive
- a Slurm cluster
- a Slurm login node
- a Slurm controller
- An auto-scaling Slurm partition
After you have inspected the file, use the gcluster binary to create a deployment folder by running:
./gcluster create docs/tutorials/openfoam/spack-openfoam.yaml --vars project_id=<walkthrough-project-id/>
NOTE: The
--vars
argument is used to overrideproject_id
in the deployment variables.
This will create a deployment directory named spack-openfoam/
, which
contains the terraform needed to deploy your cluster.
Use below command to deploy your cluster.
./gcluster deploy spack-openfoam
You can also use below command to generate a plan that describes the Google Cloud resources that will be deployed.
terraform -chdir=spack-openfoam/primary init
terraform -chdir=spack-openfoam/primary apply
Apply complete! Resources: xx added, 0 changed, 0 destroyed.
Although the cluster has been successfully deployed, the startup scripts that install Spack and OpenFOAM take additional time to complete. When run without a Spack cache, this installation takes about 2.5 hrs (or 6 min with complete cache).
The following command will print logging from the startup script running on the controller. This command can be used to view progress and check for completion of the startup script:
gcloud compute instances get-serial-port-output --port 1 --zone us-central1-c --project <walkthrough-project-id/> spackopenf-controller | grep google_metadata_script_runner
When the startup script has finished running you will see the following line as the final output from the above command:
spackopenf-controller google_metadata_script_runner: Finished running startup scripts.
Optionally while you wait, you can see your deployed VMs on Google Cloud
Console. Open the link below in a new window. Look for
spackopenf-controller
and spackopenf-login-login-001
. If you don't
see your VMs make sure you have the correct project selected (top left).
https://console.cloud.google.com/compute?project=<walkthrough-project-id/>
Once the startup script has completed, connect to the login node.
Use the following command to ssh into the login node from cloud shell:
gcloud compute ssh spackopenf-login-login-001 --zone us-central1-c --project <walkthrough-project-id/>
You may be prompted to set up SSH. If so follow the prompts and if asked for a
password, just hit [enter]
leaving the input blank.
If the above command succeeded (and you see a Slurm printout in the console) then continue to the next page.
In some organizations you will not be able to SSH from cloud shell. If the above command fails you can SSH into the VM through the Cloud Console UI using the following instructions:
-
Open the following URL in a new tab. This will take you to
Compute Engine
>VM instances
in the Google Cloud Console:https://console.cloud.google.com/compute?project=<walkthrough-project-id/>
-
Click on the
SSH
button associated with thespackopenf-login-login-001
instance.This will open a separate pop up window with a terminal into our newly created Slurm login VM.
The commands below should be run on the Slurm login node.
We will use the submission script (see line 122 of the blueprint) to submit a OpenFOAM job.
-
Make a directory in which we will run the job:
mkdir test_run && cd test_run
-
Submit the job to Slurm to be scheduled:
sbatch /opt/apps/openfoam/submit_openfoam.sh
-
Once submitted, you can watch the job progress by repeatedly calling the following command:
squeue
The sbatch
command trigger Slurm to auto-scale up several nodes to run the job.
You can refresh the Compute Engine
> VM instances
page and see that
additional VMs are being/have been created. These will be named something like
spackopenf-comput-0
.
When running squeue
, observe the job status start as CF
(configuring),
change to R
(running) once the compute VMs have been created, and finally CG
(completing) when job has finished and nodes are spooling down.
When squeue
no longer shows any jobs the job has finished. The whole job takes
about 5 minutes to run.
NOTE: If the allocation fails, the message
salloc: PrologSlurmctld failed, job killed
most likely indicates that your project does not have sufficient quota for C2 instances in your region.
NOTE: If the Slurm controller is shut down before the auto-scale nodes are destroyed then they will be left running.
Several files will have been generated in the test_run/
folder you created.
The slurm-1.out
file has information on the run such as performance. You can
view this file by running the following command on the login node:
cat slurm-*.out
To view the monitoring dashboard containing metrics on your cluster, open the
following URL in a new tab and click on the dashboard named
Cluster Toolkit Dashboard: spack-openfoam
.
https://console.cloud.google.com/monitoring/dashboards?project=<walkthrough-project-id/>
To avoid incurring ongoing charges we will want to destroy our cluster.
For this we need to return to our cloud shell terminal. Run exit
in the
terminal to close the SSH connection to the login node:
NOTE: If you are accessing the login node terminal via a separate pop-up then make sure to call
exit
in the pop-up window.
exit
Run the following command in the cloud shell terminal to destroy the cluster:
./gcluster destroy spack-openfoam
When complete you should see something like:
Destroy complete! Resources: xx destroyed.
NOTE: If destroy is run before Slurm shut down the auto-scale nodes then they will be left behind and destroy may fail. In this case you can delete the VMs manually and rerun the destroy command above.