Migrating from the ai-on-gke repository (#2)

* Migration from ai-on-gke repository * Renamed repository * Restructured repository * Added ml-platform use case labels --------- Co-authored-by: Kent Hua <[email protected]> Co-authored-by: Kavitha Rajendran <[email protected]> Co-authored-by: Ali Zaidi <[email protected]> Co-authored-by: Shobhit Gupta <[email protected]> Co-authored-by: Xiang Shen <[email protected]> Co-authored-by: Ishmeet Mehta <[email protected]> Co-authored-by: Laurent Grangeau <[email protected]> Co-authored-by: Jun Sheng <[email protected]>
GoogleCloudPlatform · Sep 19, 2024 · 73971a5 · 73971a5
1 parent 0cff600
commit 73971a5
Show file tree

Hide file tree

Showing 257 changed files with 182,422 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,16 @@
 # IDEs
 *.code-workspace
+
+# Python
+__pycache__/
+.venv/
+venv/
+
+# Terraform
+*.terraform/
+*.terraform-*/
+*.terraform.lock.hcl
+
+# Test
+test/log/*.log
+test/scripts/environment_files/*
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -25,6 +25,18 @@ This project follows
 
 ## Contribution process
 
+### Coding style and formatting
+
+#### Python
+
+The repository requires that files use the [Black](https://github.com/psf/black) code formatter and style.
+
+#### Terraform
+
+We follow the guidelines and recommendations in the [Google Cloud Best practices for using Terraform](https://cloud.google.com/docs/terraform/best-practices-for-terraform) document, unless noted otherwise.
+
+The repository requires that files use built-in formatting using the `terraform fmt` command.
+
 ### Code reviews
 
 All submissions, including submissions by project members, require review. We

diff --git a/README.md b/README.md
@@ -1 +1,6 @@
-# Google Cloud AI/ML Platforms
+# Google Cloud Accelerated Platform Reference Architectures
+
+This repository is collection of accelerated platform reference architectures and use cases for Google Cloud.
+
+- [GKE AI/ML Platform for enabling AI/ML Ops](/docs/platforms/gke-aiml/README.md)
+  - [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md)
diff --git a/docs/guides/packaging-jupyter-notebooks/README.md b/docs/guides/packaging-jupyter-notebooks/README.md
@@ -0,0 +1,95 @@
+# Packaging Jupyter notebook as deployable code
+
+Jupyter notebook is widely used by data scientists and machine learning experts in their day to day work to interactively and iteratively develop. However, the `ipynb` format is typically not used as a deployable or packagable artifact. There are two scenarios that notebooks are converted to deployable/package artifacts:
+
+1. Model training tasks needed to convert to batch jobs to scale up with more computational resources
+1. Model inference tasks needed to convert to an API server to serve the end-user requests
+
+In this guide we will showcase two different tools which may help facilitate converting your notebook to a deployable/packageable raw python library.
+
+This process can also be automated utilizing Continuous Integration (CI) tools such as [Cloud Build](https://cloud.google.com/build/).
+
+## Use jupytext to convert notebook to raw python and containerize
+
+1. Update the notebook to `Pair Notebook with Percent Format`
+
+   Jupytext comes with recent jupyter notebook or jupyter-lab. In addition to just converting from `ipynb` to python, it can pair between the formats. This allows for updates made in `ipynb` to be propagated to python and vice versa.
+
+   To pair the notebook, simply use the pair function in the File menu:
+
+   ![jupyter-pairing](images/jupyter-pairing.png)
+
+   In this example we use the file [gpt-j-online.ipynb](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/examples/notebooks/gpt-j-online.ipynb):![jupyter-gpt-j-online-ipynb](images/jupyter-gpt-j-online-ipynb.png)
+
+1. After pairing, we get the generated raw python:
+
+   ![jupyter-gpt-j-online-py](images/jupyter-gpt-j-online-py.png)
+
+   **NOTE**: This conversion can also be performed via the `jupytext` cli with the following command:
+
+   ```sh
+   jupytext --set-formats ipynb,py:percent \
+       --to py gpt-j-online.ipynb
+   ```
+
+1. Extract the module dependencies
+
+   In the notebook environment, users typically install required python modules using `pip install` commands, but in the container environment, these dependencies need to be installed into the container prior to executing the python library.
+
+   We can use the `pipreqs` tool to generate the dependencies. Add the following snippet in a new cell of your notebook and run it:
+
+   ```sh
+   !pip install pipreqs
+   !pipreqs --scan-notebooks
+   ```
+
+   The following is an example output:
+
+   ![jupyter-generate-requirements](images/jupyter-generate-requirements.png)
+   **NOTE**: (the `!cat requirements.txt` line is an example of the generated `requirements.txt`)
+
+1. Create the Dockerfile
+
+   To create the docker image of your generated raw python, we need to create a `Dockerfile`, below is an example. Replace `_THE_GENERATED_PYTHON_FILE_` with your generated python file:
+
+   ```Dockerfile
+   FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
+
+   RUN apt-get update && \
+       apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
+       rm -rf /var/lib/apt/lists/*
+
+   COPY requirements.txt _THE_GENERATED_PYTHON_FILE_ /_THE_GENERATED_PYTHON_FILE_
+
+   RUN pip3 install --no-cache-dir -r requirements.txt
+
+   ENV PYTHONUNBUFFERED 1
+
+   CMD python3 /_THE_GENERATED_PYTHON_FILE_
+   ```
+
+1. [Optional] Lint and remove unused code
+
+   Using `pylint` to validate the generated code is a good practice. Pylint can detect unordered `import` statements, unused code and provide code readability suggestions.
+
+   To use `pylint`, create a new cell in your notebook, run the code below and replace `_THE_GENERATED_PYTHON_FILE_` to your filename:
+
+   ```sh
+   !pip install pylint
+   !pylint _THE_GENERATED_PYTHON_FILE_
+   ```
+
+## Use nbconvert to convert notebook to raw python
+
+We can convert a Jupyter notebook to python script using nbconvert tool.  
+The nbconvert tool is available inside your Jupyter notebook environment in Google Colab Enterprise. If you are in another environment and it is not available, it can be found [here](https://pypi.org/project/nbconvert/)
+
+1. Run the nbconvert command in your notebook. In this example, we are using `gsutil` to copy the notebook to the Colab Enterprise notebook.
+
+   ```sh
+   !jupyter nbconvert --to python Fine-tune-Llama-Google-Colab.ipynb
+   ```
+
+   Below is an example of the commands
+
+   ![jupyter-nbconvert](images/jupyter-nbconvert.png)
diff --git a/docs/guides/packaging-jupyter-notebooks/images/dockerfile.png b/docs/guides/packaging-jupyter-notebooks/images/dockerfile.png
diff --git a/docs/guides/packaging-jupyter-notebooks/images/jupyter-generate-requirements.png b/docs/guides/packaging-jupyter-notebooks/images/jupyter-generate-requirements.png
diff --git a/docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-ipynb.png b/docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-ipynb.png
diff --git a/docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-py.png b/docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-py.png
diff --git a/docs/guides/packaging-jupyter-notebooks/images/jupyter-nbconvert.png b/docs/guides/packaging-jupyter-notebooks/images/jupyter-nbconvert.png
diff --git a/docs/guides/packaging-jupyter-notebooks/images/jupyter-pairing.png b/docs/guides/packaging-jupyter-notebooks/images/jupyter-pairing.png
diff --git a/docs/platforms/gke-aiml/README.md b/docs/platforms/gke-aiml/README.md
@@ -0,0 +1,71 @@
+# GKE AI/ML Platform reference architecture for enabling Machine Learning Operations (MLOps)
+
+## Platform Principles
+
+This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:
+
+- The platform admin will create the GKE platform using IaC tool like [Terraform](https://www.terraform.io/). The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
+- The platform will be based on [GitOps](https://about.gitlab.com/topics/gitops/).
+- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync](https://cloud.google.com/anthos-config-management/docs/config-sync-overview) by the admins.
+- Platform admins will create a namespace per application and provide the application team member full access to it.
+- The namespace scoped resources will be created by the Application/ML teams either via Config Sync or through a deployment tool like [Cloud Deploy](https://cloud.google.com/deploy)
+
+For an outline of products and features used in the platform, see the [Platform Products and Features](products-and-features.md) document.
+
+## Critical User Journeys (CUJs)
+
+### Persona : Platform Admin
+
+- Offer a platform that incorporates established best practices.
+- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
+- Establish secure channels for end users to interact seamlessly with the platform.
+- Empower the enforcement of robust security policies across the platform.
+
+### Persona : Machine Learning Engineer
+
+- Deploy the model with ease and make the endpoints available only to the intended audience
+- Continuously monitor the model performance and resource utilization
+- Troubleshoot any performance or integration issues
+- Ability to version, store and access the models and model artifacts:
+  - To debug & troubleshoot in production and track back to the specific model version & associated training data
+  - To quick & controlled rollback to a previous, more stable version
+- Implement the feedback loop to adapt to changing data & business needs:
+  - Ability to retrain / fine-tune the model.
+  - Ability to split the traffic between models (A/B testing)
+  - Switching between the models without breaking inference system for the end-users
+- Ability to scaling up/down the infra to accommodate changing needs
+- Ability to share the insights and findings with stakeholders to take data-driven decisions
+
+### Persona : Machine Learning Operator
+
+- Provide and maintain software required by the end users of the platform.
+- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
+- Deploy the workloads on the platform.
+- Assist with enabling observability and monitoring for the workloads to ensure smooth operations.
+
+## Prerequisites
+
+- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
+- Familiarity with following
+  - [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
+  - [Terraform](https://www.terraform.io/)
+  - [git](https://git-scm.com/)
+  - [Google Configuration Management root-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields)
+  - [Google Configuration Management repo-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields)
+  - [GitHub](https://github.com/)
+
+## Deploy the platform
+
+[Playground Reference Architecture](/platforms/gke-aiml/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.
+
+## Use cases
+
+- [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md)
+  - [Distributed Data Processing with Ray](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md): Run a distributed data processing job using Ray.
+  - [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
+  - [Fine Tuning Gemma2 9B IT model With FSDP](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP
+  - [Model evaluation and validation](/use-cases/model-fine-tuning-pipeline/model-eval/README.md): Evaluation and validation of the fine tuned Gemma2 9B IT model
+
+## Resources
+
+- [Packaging Jupyter notebooks](/docs/guides/packaging-jupyter-notebooks/README.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.
diff --git a/docs/platforms/gke-aiml/playground/architecture.md b/docs/platforms/gke-aiml/playground/architecture.md
@@ -0,0 +1,46 @@
+# Playground AI/ML Platform on GKE: Architecture
+
+![Playground Architecture](/docs/platforms/gke-aiml/playground/images/architecture.svg)
+
+## Platform
+
+- [Google Cloud Project](https://console.cloud.google.com/cloud-resource-manager)
+  - Environment project
+  - Service APIs
+- [Cloud Storage](https://console.cloud.google.com/storage/browser)
+  - Terraform bucket
+- [VPC networks](https://console.cloud.google.com/networking/networks/list)
+  - VPC network
+    - Subnet
+- [Cloud Router](https://console.cloud.google.com/hybrid/routers/list)
+  - Cloud NAT gateway
+- [Google Kubernetes Engine (GKE)](https://console.cloud.google.com/kubernetes/list/overview)
+  - Standard Cluster
+    - CPU on-demand node pool
+    - CPU system node pool
+    - GPU on-demand node pool
+    - GPU spot node pool
+- [Google Kubernetes Engine (GKE) Enterprise](https://cloud.google.com/kubernetes-engine/enterprise/docs)
+  - Configuration Management
+    - Config Sync
+    - Policy Controller
+  - Connect gateway
+  - Fleet
+  - Security posture dashboard
+    - Threat detection
+- Git repository
+  - Config Sync
+
+### Each namespace
+
+- [Load Balancer](https://console.cloud.google.com/net-services/loadbalancing/list/loadBalancers)
+  - Gateway External Load Balancer
+- [Classic SSL Certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates)
+  - Gateway SSL Certificate
+    - Ray dashboard
+- [Identity-Aware Proxy (IAP)](https://cloud.google.com/iap/docs/concepts-overview)
+  - Ray head Backend Service
+- [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccount)
+  - Default
+  - Ray head
+  - Ray worker
diff --git a/docs/platforms/gke-aiml/playground/images/architecture.svg b/docs/platforms/gke-aiml/playground/images/architecture.svg
diff --git a/docs/platforms/gke-aiml/playground/images/architecture_full.svg b/docs/platforms/gke-aiml/playground/images/architecture_full.svg
diff --git a/docs/platforms/gke-aiml/playground/images/configsync.png b/docs/platforms/gke-aiml/playground/images/configsync.png
diff --git a/docs/platforms/gke-aiml/playground/images/oauth-consent-screen.png b/docs/platforms/gke-aiml/playground/images/oauth-consent-screen.png