Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

Shell script to create env variables in case of cloud shell timeout #55

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
25 changes: 25 additions & 0 deletions 2-kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,31 @@ spec:
restartPolicy: OnFailure # restart the pod if it fails
```

For non-GPU clusters:
```yaml
apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
name: module2-ex1 # Name of our job
spec:
template: # Template of the Pod that is going to be run by the Job
metadata:
name: module2-ex1 # Name of the pod
spec:
containers: # List of containers that should run inside the pod, in our case there is only one.
- name: tensorflow
image: ${DOCKER_USERNAME}/tf-mnist:cpu # The image to run, you can replace by your own.
args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
volumeMounts:
- name: nvidia
mountPath: /usr/local/nvidia
volumes:
- name: nvidia
hostPath:
path: /usr/local/nvidia
restartPolicy: OnFailure # restart the pod if it fails
```

Save this template somewhere and deploy it with:

```console
Expand Down
31 changes: 30 additions & 1 deletion 4-kubeflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,25 @@ Kubeflow uses [`ksonnet`](https://github.com/ksonnet/ksonnet) templates as a way

First, install ksonnet version [0.13.1](https://ksonnet.io/#get-started), or you can [download a prebuilt binary](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) for your OS.

Pull down ksonnet in Cloud Shell

```
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
```
untar the zip

```
tar -zxvf ks_0.13.1_linux_amd64.tar.gz
```

Add the ksonnet cli to the CLoud Shell path

```
PATH=$PATH:~/ks_0.13.1_linux_amd64/
```
>NOTE: You may have to run this again if your cloud shell times out as it will not persist across sessions.


Then run the following commands to download Kubeflow:

```bash
Expand Down Expand Up @@ -65,6 +84,16 @@ ${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

`kubectl get pods -n kubeflow`

To make the kubeflow namespace the default for your context enter:
```cli
kubectl config get-contexts
```
Fine the name of you existing context which is the name of the cluster
```cli
kubectl config set-context aks-ejv --namespace kubeflow
```
Please use your own cluster name instread of aks-ejv

should return something like this:

```
Expand Down Expand Up @@ -100,7 +129,7 @@ kubeflow workflow-controller-cf79dfbff-lv7jk 1/1

The most important components for the purpose of this lab are `jupyter-0` which is the JupyterHub spawner running on your cluster, and `tf-job-operator-v1beta1-5949f668f7-j5zrn` which is a controller that will monitor your cluster for new TensorFlow training jobs (called `TfJobs`) specifications and manages the training, we will look at this two components later.

### Remove Kubeflow
### Remove Kubeflow _ONLY IF YOU ARE DONE WITH LABS!!!!

If you want to remove the Kubeflow deployment, you can run the following to remove the namespace and installed components:

Expand Down
10 changes: 7 additions & 3 deletions 5-jupyterhub/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,15 +68,19 @@ Then navigate to JupyterHub: http://localhost:8080/hub
To update the default service created for JupyterHub, run the following command to change the service to type LoadBalancer:

```bash
cd ks_app
cd kubeflow/mykubeflowapp/ks_app
ks param set jupyter serviceType LoadBalancer
cd ..
${KUBEFLOW_SOURCE}/scripts/kfctl.sh apply k8s
~/kubeflow/scripts/kfctl.sh apply k8s
```
wait for the public IP of the jupyter service
```
kubectl get svc -w
```

Create a new Jupyter Notebook instance:

- open http://localhost:8080/hub/ in your browser (or use the public IP for the service `tf-hub-lb`)
- open http://<PublicIP_OF_JUPYTER_SVC>/hub/ in your browser (or use the public IP for the service `tf-hub-lb`)
- log in using any username and password
- click the "Start My Server" button to sprawn a new Jupyter notebook
- from the image dropdown, select a tensorflow image for your notebook
Expand Down
138 changes: 138 additions & 0 deletions 6-tfjob/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,24 @@ spec:
restartPolicy: OnFailure
```

For Non-GPU clusters:
```yaml
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: module6-ex1
spec:
tfReplicaSpecs:
MASTER:
replicas: 1
template:
spec:
containers:
- image: <DOCKER_USERNAME>/tf-mnist:cpu # From module 1
name: tensorflow
restartPolicy: OnFailure
```

Save the template that applies to you in a file, and create the `TFJob`:

```console
Expand Down Expand Up @@ -183,6 +201,78 @@ Be aware of a few details first:
- PVC are namespaced so be sure to create it on the same namespace that is launching the TFJob objects
- If you are using RBAC might need to run the cluster role and binding: [see docs here](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-cluster-role-and-binding)

Create an `azurefiles-rbac.yaml` file
```yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:azure-cloud-provider
rules:
- apiGroups: ['']
resources: ['secrets']
verbs: ['get','create']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:azure-cloud-provider
roleRef:
kind: ClusterRole
apiGroup: rbac.authorization.k8s.io
name: system:azure-cloud-provider
subjects:
- kind: ServiceAccount
name: persistent-volume-binder
namespace: kube-system
```

apply the rbac files
```cli
kubectl apply -f azurefiles-rbac.yaml
```

Create an `azurefiles-class.yaml` files
```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: azurefile
provisioner: kubernetes.io/azure-file
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=1000
- gid=1000
parameters:
skuName: Standard_LRS
```

Apply the storage class to the cluster
```cli
kubectl apply -f azurefiles-class.yaml
```

Create an `azurefiles-pvc.yaml` file
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: azurefile
spec:
accessModes:
- ReadWriteMany
storageClassName: azurefile
resources:
requests:
storage: 5Gi
```

Apply the pvc file to the cluster
```cli
kubectl apply -f azurefiles-pvc.yaml
```

Once you completed all the steps, run:

```console
Expand Down Expand Up @@ -219,6 +309,22 @@ Turns out mounting an Azure File share into a container is really easy, we simpl
claimName: azurefile
```

For non-GPU Clusters:
```yaml
[...]
containers:
- image: <IMAGE>
name: tensorflow
volumeMounts:
- name: azurefile
subPath: module6-ex2
mountPath: /tmp/tensorflow
volumes:
- name: azurefile
persistentVolumeClaim:
claimName: azurefile
```

Update your template from exercise 1 to mount the Azure File share into your container, and create your new job.

Once the container starts running, if you go to the Azure Portal, into your storage account, and browse your `tensorflow` file share, you should see something like that:
Expand Down Expand Up @@ -265,6 +371,38 @@ spec:
claimName: azurefile
```

For non-GPU cluster:
```yaml
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: module6-ex2
spec:
tfReplicaSpecs:
MASTER:
replicas: 1
template:
spec:
containers:
- image: <DOCKER_USERNAME>/tf-mnist:cpu
name: tensorflow
volumeMounts:
# By default our classifier saves the summaries in /tmp/tensorflow,
# so that's where we want to mount our Azure File Share.
- name: azurefile
# The subPath allows us to mount a subdirectory within the azure file share instead of root
# this is useful so that we can save the logs for each run in a different subdirectory
# instead of overwriting what was done before.
subPath: module6-ex2
mountPath: /tmp/tensorflow
restartPolicy: OnFailure
volumes:
- name: azurefile
persistentVolumeClaim:
claimName: azurefile
```


</details>

## Next Step
Expand Down
Loading