This is a sample solution based on Grafana for monitoring various component of an HPC cluster built with AWS ParallelCluster. There are 6 dashboards that can be used as they are or customized as you need.
- ParallelCluster Summary - this is the main dashboard that shows general monitoring info and metrics for the whole cluster. It includes Slurm metrics and Storage performance metrics.
- Master Node Details - this dashboard shows detailed metric for the Master node, including CPU, Memory, Network and Storage usage.
- Compute Node List - this dashboard show the list of the available compute nodes. Each entry is a link to a more detailed page.
- Compute Node Details - similarly to the master node details this dashboard show the same metric for the compute nodes.
- GPU Nodes Details - This dashboard shows GPUs releated metrics collected using nvidia-dcgm container.
- Cluster Logs - This dashboard shows all the logs of your HPC Cluster. The logs are pushed by AWS ParallelCluster to AWS ClowdWatch Logs and finally reported here.
- Cluster Costs(beta / in developemnt) - This dashboard shows the cost associated to AWS Service utilized by your cluster. It includes: EC2, EBS, FSx, S3, EFS.
Create a cluster using AWS ParallelCluster and include the following configuration:
[cluster yourcluster]
...
post_install = https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-monitoring/main/post-install.sh
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
tags = {"Grafana" : "true"}
...
AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm.
- More info on: https://aws.amazon.com/hpc/parallelcluster/
- Source Code on Git-Hub: https://github.com/aws/aws-parallelcluster
- Official Documentation: https://docs.aws.amazon.com/parallelcluster/
This project is build with the following components:
- Grafana is an open-source platform for monitoring and observability. Grafana allows you to query, visualize, alert on and understand your metrics as well as create, explore, and share dashboards fostering a data driven culture.
- Prometheus open-source project for systems and service monitoring from the Cloud Native Computing Foundation. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts if some condition is observed to be true.
- The Prometheus Pushgateway is on open-source tool that allows ephemeral and batch jobs to expose their metrics to Prometheus.
- Nginx is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server.
- Prometheus-Slurm-Exporter is a Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
- Node_exporter is a Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.
Note: while almost all components are under the Apache2 license, only Prometheus-Slurm-Exporter is licensed under GPLv3, you need to be aware of it and accept the license terms before proceeding and installing this component.
You can simply use the post-install script that you can find here as it is, or customize it as you need. For instance, you might want to change your Grafana password to something more secure and meaningful for you, or you might want to customize some dashboards by adding additional components to monitor.
#Load AWS Parallelcluster environment variables
. /etc/parallelcluster/cfnconfig
#get GitHub repo to clone and the installation script
monitoring_url=$(echo ${cfn_postinstall_args}| cut -d ',' -f 1 )
monitoring_dir_name=$(echo ${cfn_postinstall_args}| cut -d ',' -f 2 )
monitoring_tarball="${monitoring_dir_name}.tar.gz"
setup_command=$(echo ${cfn_postinstall_args}| cut -d ',' -f 3 )
monitoring_home="/home/${cfn_cluster_user}/${monitoring_dir_name}"
case ${cfn_node_type} in
MasterServer)
wget ${monitoring_url} -O ${monitoring_tarball}
mkdir -p ${monitoring_home}
tar xvf ${monitoring_tarball} -C ${monitoring_home} --strip-components 1
;;
ComputeFleet)
;;
esac
#Execute the monitoring installation script
bash -x "${monitoring_home}/parallelcluster-setup/${setup_command}" >/tmp/monitoring-setup.log 2>&1
exit $?
The proposed post-install script will take care of installing and configuring everything for you through the install-monitoring.sh script. Though, few additional parameters are needed in the AWS ParallelCluster config file: the post_install_args, additional IAM policies, security group, and a tag. You can find an AWS ParallelCluster template here. Please note that, at the moment, the installation script has only been tested using Amazon Linux 2.
base_os = alinux2
post_install = s3://<my-bucket-name>/post-install.sh
post_install_args = https://github.com/aws-samples/aws-parallelcluster-monitoring/tarball/main,aws-parallelcluster-monitoring,install-monitoring.sh
additional_iam_policies = arn:aws:iam::aws:policy/CloudWatchFullAccess,arn:aws:iam::aws:policy/AWSPriceListServiceFullAccess,arn:aws:iam::aws:policy/AmazonSSMFullAccess,arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
tags = {“Grafana” : “true”}
Make sure that port 80
and port 443
of your master node are accessible from the internet (or form your network). You can achieve this by creating the appropriate security group via AWS Web-Console or via CLI, see an example below:
aws ec2 create-security-group --group-name my-grafana-sg --description "Open HTTP/HTTPS ports" —vpc-id vpc-1a2b3c4d
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 443 —cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id sg-12345 --protocol tcp --port 80 —cidr 0.0.0.0/0
More information on how to create your security groups here.
Finally, set the additional_sg parameter in the [VPC]
section of your ParallelCluster config file.
After your cluster is created, you can just open a web-browser and connect to https://your_public_ip
or http://your_public_ip
(all http
connections will be automatically redirected to https
), a landing page will be presented to you with links to the Prometheus database service and the Grafana dashboards.
Note: Because of the higher volume of network traffic due to the compute nodes continuously pushing metrics to the master node, in case you expect to run a large scale cluster (hundreds of instances), we would recommend to use an instance type slightly bigger than what you planned for your master node.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.