This repo contains scripts and instructions to create your own Cloud Layer for efficiently running and scaling your DL training jobs.
Recommended setup:
- Use sharded versions of CIFAR10, and ImageNet that are already available at below locations.
- CIFAR10: http://storage.googleapis.com/lpr-demo
- ImageNet: http://storage.googleapis.com/lpr-imagenet
- Create kubernetes cluster to meet your needs - GPU type (K80, P100, V100), Number of GPUs per node (1, 2, 4, 8).
- Create a docker image of your trainer.
- Use deploy script and launch your trainer (explained later).
-
dltrainer Contains Dockerfile and kubernetes deoployment file for PyTorch trainer.
-
cache-server Contains Dockerfile and kubernetes deployment file for NGINX cache-server.
-
ku script: Contains commands to setup kubernetes cluster on Google Cloud
-
kuber-cluster-config.sh script: Contains parameters that can be configured to customize the kubernetes setup
Before proceeding, please ensure you have following packages installed locally; follow instructions available online.
- Docker
- Google Cloud SDK: After installing Google Cloud SDK, run
gcloud init
- Set parameters in
kube-cluster-config.sh
- Call
./ku init
. Please don't kill the execution inbetween. The command does the following:- Creates a kubernetes clusters
- Creates a GPU node-pool with each node containing requested number of GPUs
- Install cache-server and deploys it into the cluster
Change to dltrainer folder.
- Three important files are
- train.py: PyTorch trainer
- model.py: Neural Network model
- dataset.py: Dataset loader. This file also serves as an example to fetch data using dlinputs library.
- Build the docker image
docker build -t name_of_your_docker_image .
- Tag docker image
docker tag name_of_your_docker_image gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
- Upload docker image to a cloud repository.
gcloud docker -- push gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
- Configure location of your image in trainer-deploy.yml
spec:
template:
spec:
containers:
- name: imagenet-training
image: gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
- Configure other parameters required for your training job in
trainer-deploy.yml
- type of GPUs, number of GPUs.
apiVersion: batch/v1
kind: Job
metadata:
name: trainer-job
spec:
template:
spec:
containers:
- name: imagenet-training
image: gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1 # TODO Put location of your image on cloud repository
command: ["python"]
args:
- "train.py"
- "--devices"
- "1" # TODO Set the number of GPUs required by your Job
resources:
limits:
nvidia.com/gpu: 1 # TODO Set this number to same as the number of GPUs required by your Job
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-k80
restartPolicy: Never
backoffLimit: 4
- Deploy your job
kubectl create -f trainer-job.yml
The base docker image also comes with a built-in profiler, which tracks GPU, CPU and Network Utilization every 30seconds.
Below is a sample dashboard from training a network across 4 GPUs.