Skip to content

Latest commit

 

History

History

AWS Batch distributed training architectures

This architecture serves as an example to run distributed training jobs on p4d.24xlarge instances but can be easily be modified to accommodate other instance kinds (Trn or other P instances).

Important: it is assumed that you deployed the VPC template 2.vpc-one-az.yaml as our Batch template will fetch automatically the EFA Security Group ID (SG) and Subnet ID to setup the AWS Batch Compute Environment. Both the SG and Subnet are exported values from the VPC template.

This architecture consists of the following resources:

Template

This template deploys AWS Batch and EC2 resources. It can be deployed via the console and the AWS CLI. Regardless of the deployment method it is assumed that you deployed the VPC template 2.vpc-one-az.yaml prior to deploying that one.

Quick Create


 1-Click Deploy 🚀 

List of Parameters

The templates takes parameters that are mandatory and optional, see below for more details.

Name Type Details
VPCStackParameter Required Name of the VPC stack in CloudFormation.
AMI Optional ID of the AMI if using a custom one otherwise leave blank
CapacityReservationId Optional Use that or the ResourceGroup to refer to an EC2 reservation
CapacityReservationResourceGroup Optional Use that or the CapacityReservationId.
EC2KeyPair Optional EC2 key pair to use in case you want to connect through ssh for debug.
CreatePlacementGroup Optional Create a placement group for the instances.

Deploy with the AWS CLI

If you'd like to deploy through the AWS CLI instead of the quick create link above, the command to deploy the template is shown below. Please edit the parameters values with your own configuration.

aws cloudformation create-stack --stack-name aws-batch-p5 \
                                --template-body file://0.aws-batch-distributed-training-p5.yaml \
                                --parameters ParameterKey=VPCStackParameter,ParameterValue="aws-batch-vpc" \
                                             ParameterKey=CapacityReservationId,ParameterValue="cr-1234567890" \
                                --capabilities CAPABILITY_NAMED_IAM

Gotchas

There are a few things to know as you evaluate this architecture:

  • EFA interfaces need to be declared explicitly in the EC2 Launch Template and you need to provide the security group used for EFA.
  • The Compute Environment must retrieve the list of private subnets from the VPC template. This list is exported by the VPC template.
  • The Batch Job Definition assumes you are pushing a container with stress-ng and is pre-configured as such.

Architecture Diagram