TaxiRL Training

Overview of code structure:

dltrainer Contains Dockerfile and kubernetes deoployment file for PyTorch trainer.
cache-server Contains Dockerfile and kubernetes deployment file for NGINX cache-server.
ku script: Contains commands to setup kubernetes cluster on Google Cloud
kuber-cluster-config.sh script: Contains parameters that can be configured to customize the kubernetes setup

Before proceeding, please ensure you have following packages installed locally; follow instructions available online.

Set parameters in kube-cluster-config.sh
Call ./ku init. Please don't kill the execution inbetween. The command does the following:
- Creates a kubernetes clusters
- Creates a GPU node-pool with each node containing requested number of GPUs
- Install cache-server and deploys it into the cluster

Change to dltrainer folder.

Three important files are
- train.py: PyTorch trainer
- model.py: Neural Network model
- dataset.py: Dataset loader. This file also serves as an example to fetch data using dlinputs library.
Build the docker image
docker build -t name_of_your_docker_image .
Tag docker image
docker tag name_of_your_docker_image gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
Upload docker image to a cloud repository.
gcloud docker -- push gcr.io/$GCLOUD_PROJECT_NAME/name_of_your_docker_image:v1
Configure location of your image in trainer-job.yml
Configure other parameters required for your training job in trainer-deploy.yml - type of GPUs, number of GPUs.
Deploy your job
kubectl create -f trainer-job.yml

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cache-server		cache-server
dltrainer		dltrainer
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
README.ipynb		README.ipynb
README.md		README.md
ku		ku
kube-cluster-config.sh		kube-cluster-config.sh