-
dltrainer Contains Dockerfile and kubernetes deoployment file for PyTorch trainer.
-
cache-server Contains Dockerfile and kubernetes deployment file for NGINX cache-server.
-
ku script: Contains commands to setup kubernetes cluster on Google Cloud
-
kuber-cluster-config.sh script: Contains parameters that can be configured to customize the kubernetes setup
Before proceeding, please ensure you have following packages installed locally; follow instructions available online.
- Docker
- Google Cloud SDK: After installing Google Cloud SDK, run
gcloud init
- Set parameters in
kube-cluster-config.sh
- Call
./ku init
. Please don't kill the execution inbetween. The command does the following:- Creates a kubernetes clusters
- Creates a GPU node-pool with each node containing requested number of GPUs
- Install cache-server and deploys it into the cluster
Change to dltrainer folder.
-
Three important files are
- train.py: PyTorch trainer
- model.py: Neural Network model
- dataset.py: Dataset loader. This file also serves as an example to fetch data using dlinputs library.
-
Build the docker image
docker build -t name_of_your_docker_image .
-
Tag docker image
docker tag name_of_your_docker_image gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
-
Upload docker image to a cloud repository.
gcloud docker -- push gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
-
Configure location of your image in trainer-job.yml
-
Configure other parameters required for your training job in
trainer-deploy.yml
- type of GPUs, number of GPUs. -
Deploy your job
kubectl create -f trainer-job.yml