SwishDL

Description

This repo contains scripts and instructions to create your own Cloud Layer for efficiently running and scaling your DL training jobs.

Recommended setup:

Use sharded versions of CIFAR10, and ImageNet that are already available at below locations.
- CIFAR10: http://storage.googleapis.com/lpr-demo
- ImageNet: http://storage.googleapis.com/lpr-imagenet
Create kubernetes cluster to meet your needs - GPU type (K80, P100, V100), Number of GPUs per node (1, 2, 4, 8).
Create a docker image of your trainer.
Use deploy script and launch your trainer (explained later).

Overview of code structure:

dltrainer Contains Dockerfile and kubernetes deoployment file for PyTorch trainer.
cache-server Contains Dockerfile and kubernetes deployment file for NGINX cache-server.
ku script: Contains commands to setup kubernetes cluster on Google Cloud
kuber-cluster-config.sh script: Contains parameters that can be configured to customize the kubernetes setup

Initial setup

Before proceeding, please ensure you have following packages installed locally; follow instructions available online.

Docker
Google Cloud SDK: After installing Google Cloud SDK, run gcloud init

Kubernetes cluster setup:

Set parameters in kube-cluster-config.sh
Call ./ku init. Please don't kill the execution inbetween. The command does the following:
- Creates a kubernetes clusters
- Creates a GPU node-pool with each node containing requested number of GPUs
- Install cache-server and deploys it into the cluster

Deploying PyTorch trainer on Kubernetes cluster:

Change to dltrainer folder.

Three important files are
- train.py: PyTorch trainer
- model.py: Neural Network model
- dataset.py: Dataset loader. This file also serves as an example to fetch data using dlinputs library.
Build the docker image
docker build -t name_of_your_docker_image .
Tag docker image
docker tag name_of_your_docker_image gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
Upload docker image to a cloud repository.
gcloud docker -- push gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
Configure location of your image in trainer-deploy.yml

spec:
    template:
        spec:
          containers:
          - name: imagenet-training
            image: gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1

Configure other parameters required for your training job in trainer-deploy.yml - type of GPUs, number of GPUs.

apiVersion: batch/v1
kind: Job
metadata:
  name: trainer-job
spec:
  template:
    spec:
      containers:
      - name: imagenet-training
        image: gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1 # TODO Put location of your image on cloud repository
        command: ["python"]
        args:
        - "train.py"
        - "--devices"
        - "1" # TODO Set the number of GPUs required by your Job        
        resources:          
          limits:
            nvidia.com/gpu: 1 # TODO Set this number to same as the number of GPUs required by your Job
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-k80 
      restartPolicy: Never
  backoffLimit: 4

Deploy your job
kubectl create -f trainer-job.yml

Profiler Dashboard

The base docker image also comes with a built-in profiler, which tracks GPU, CPU and Network Utilization every 30seconds.

Below is a sample dashboard from training a network across 4 GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
cache-server		cache-server
dltrainer		dltrainer
log		log
log_new		log_new
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
README.ipynb		README.ipynb
README.md		README.md
ku		ku
kube-cluster-config.sh		kube-cluster-config.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SwishDL

Description

Overview of code structure:

Initial setup

Kubernetes cluster setup:

Deploying PyTorch trainer on Kubernetes cluster:

Profiler Dashboard

About

Releases

Packages

Languages

felarof99/SwishDL

Folders and files

Latest commit

History

Repository files navigation

SwishDL

Description

Overview of code structure:

Initial setup

Kubernetes cluster setup:

Deploying PyTorch trainer on Kubernetes cluster:

Profiler Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages