Param benchmark is a PyTorch benchmark for computation (GEMM, MLP, EmbeddingBag), communication (NCCL, DLRMs, TraceReplay), workloads (DLRM, training). This guide only addresses communications but serves as an template for other tests.
This guide assumes that you have the following:
- A functional Slurm cluster on AWS.
- Docker, Pyxis and Enroot installed.
- A shared directory mounted on
/apps
It is recommended that you use the templates in the architectures directory
The AWS Deep learning containers is used as a base for this project and the Param benchmark is built on top of it following the steps below. It is assumed that you copied the assets (Dockerfile
and sbatch
file) to your cluster.
- Login to AWS ECR with command bellow
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
- Build the container image with the command below
docker build -t param-benchmark -f 0.param-benchmark.Dockerfile .
- Convert the container image to a squash file via Enroot
The file will be stored in the
enroot import -o /apps/param-benchmark.sqsh dockerd://param-benchmark:latest
/apps
directory.
Note: We use specific commit due to lack of versioning of param benchmark, which would otherwise be used to pin the dependency.
Ensure that the submission file 1.param-benchmark.sbatch
has been copied to your cluster and your shell points to the directory where this file is present. Run the Param benchmark on 2 nodes as follows:
sbatch -N 2 --export=NONE 1.param-benchmark-comms.sbatch
The command will return a job ID. If you stay in the same directory where sbatch
was called, assuming that your job is executing (or has executed) you will find a output file for your job names param_benchmark_<ID>.out
where ID corresponds to your job ID.
We add --export=NONE
parameter to make sure any conflicting environment variable from AMI is not exported to container.
Note: the number of nodes used for the job is defined via the command line with the option and argument
-N 2
. An alternative is to set it in thesbatch
file as the directive#SBATCH -N 2
.
You will see NCCL test outputs in the logs (in your local directory), refer to NCCL-tests documentation to understand the difference between AlbBW
and BusBw
. Review the Param documentation for other CLI parameters and other benchmarks in the repository for more communication and computation tests (example here or here).
Param contains several tests that you can use to evaluate your system. Run the command below to execute single node compute tests for GEMM (Matrix Multiply), EmbeddingBag and MLP.
sbatch -N 1 --export=NONE 2.param-benchmark-compute.sbatch
You should see an output similar to the sample below (MatMult):
0: Measuring the performance of gemm on device = gpu
0: Steps = 100 warmups = 10
0: with matrix dataset A , Data type: float32
0:
0: ----------------------------------------------------------------
0: M N K Time(s) Rate(TF/s)
0: ----------------------------------------------------------------
0: 128, 4096, 4096, 0.000326 13.158
0: 256, 4096, 4096, 0.000610 14.077
0: 512, 4096, 4096, 0.001197 14.347
0: 1024, 4096, 4096, 0.001845 18.624
0: 128, 1024, 1024, 0.000030 8.803
0: 256, 1024, 1024, 0.000048 11.107
0: 512, 1024, 1024, 0.000078 13.700
0: 1024, 1024, 1024, 0.000130 16.558
0: 4096, 4096, 128, 0.000238 18.022
0: 4096, 4096, 256, 0.000462 18.584
0: 4096, 4096, 512, 0.000912 18.829
0: 4096, 4096, 1024, 0.001814 18.942
0: 1024, 1024, 128, 0.000026 10.282
0: 1024, 1024, 256, 0.000040 13.291
0: 1024, 1024, 512, 0.000071 15.208
- [A] Uros Lipovsek - lipovsek@
- [R] Pierre-Yves Aquilanti - pierreya@