-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Migrating from the ai-on-gke repository (#2)
* Migration from ai-on-gke repository * Renamed repository * Restructured repository * Added ml-platform use case labels --------- Co-authored-by: Kent Hua <[email protected]> Co-authored-by: Kavitha Rajendran <[email protected]> Co-authored-by: Ali Zaidi <[email protected]> Co-authored-by: Shobhit Gupta <[email protected]> Co-authored-by: Xiang Shen <[email protected]> Co-authored-by: Ishmeet Mehta <[email protected]> Co-authored-by: Laurent Grangeau <[email protected]> Co-authored-by: Jun Sheng <[email protected]>
- Loading branch information
1 parent
0cff600
commit 73971a5
Showing
257 changed files
with
182,422 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,16 @@ | ||
# IDEs | ||
*.code-workspace | ||
|
||
# Python | ||
__pycache__/ | ||
.venv/ | ||
venv/ | ||
|
||
# Terraform | ||
*.terraform/ | ||
*.terraform-*/ | ||
*.terraform.lock.hcl | ||
|
||
# Test | ||
test/log/*.log | ||
test/scripts/environment_files/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,6 @@ | ||
# Google Cloud AI/ML Platforms | ||
# Google Cloud Accelerated Platform Reference Architectures | ||
|
||
This repository is collection of accelerated platform reference architectures and use cases for Google Cloud. | ||
|
||
- [GKE AI/ML Platform for enabling AI/ML Ops](/docs/platforms/gke-aiml/README.md) | ||
- [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Packaging Jupyter notebook as deployable code | ||
|
||
Jupyter notebook is widely used by data scientists and machine learning experts in their day to day work to interactively and iteratively develop. However, the `ipynb` format is typically not used as a deployable or packagable artifact. There are two scenarios that notebooks are converted to deployable/package artifacts: | ||
|
||
1. Model training tasks needed to convert to batch jobs to scale up with more computational resources | ||
1. Model inference tasks needed to convert to an API server to serve the end-user requests | ||
|
||
In this guide we will showcase two different tools which may help facilitate converting your notebook to a deployable/packageable raw python library. | ||
|
||
This process can also be automated utilizing Continuous Integration (CI) tools such as [Cloud Build](https://cloud.google.com/build/). | ||
|
||
## Use jupytext to convert notebook to raw python and containerize | ||
|
||
1. Update the notebook to `Pair Notebook with Percent Format` | ||
|
||
Jupytext comes with recent jupyter notebook or jupyter-lab. In addition to just converting from `ipynb` to python, it can pair between the formats. This allows for updates made in `ipynb` to be propagated to python and vice versa. | ||
|
||
To pair the notebook, simply use the pair function in the File menu: | ||
|
||
![jupyter-pairing](images/jupyter-pairing.png) | ||
|
||
In this example we use the file [gpt-j-online.ipynb](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/examples/notebooks/gpt-j-online.ipynb):![jupyter-gpt-j-online-ipynb](images/jupyter-gpt-j-online-ipynb.png) | ||
|
||
1. After pairing, we get the generated raw python: | ||
|
||
![jupyter-gpt-j-online-py](images/jupyter-gpt-j-online-py.png) | ||
|
||
**NOTE**: This conversion can also be performed via the `jupytext` cli with the following command: | ||
|
||
```sh | ||
jupytext --set-formats ipynb,py:percent \ | ||
--to py gpt-j-online.ipynb | ||
``` | ||
|
||
1. Extract the module dependencies | ||
|
||
In the notebook environment, users typically install required python modules using `pip install` commands, but in the container environment, these dependencies need to be installed into the container prior to executing the python library. | ||
|
||
We can use the `pipreqs` tool to generate the dependencies. Add the following snippet in a new cell of your notebook and run it: | ||
|
||
```sh | ||
!pip install pipreqs | ||
!pipreqs --scan-notebooks | ||
``` | ||
|
||
The following is an example output: | ||
|
||
![jupyter-generate-requirements](images/jupyter-generate-requirements.png) | ||
**NOTE**: (the `!cat requirements.txt` line is an example of the generated `requirements.txt`) | ||
|
||
1. Create the Dockerfile | ||
|
||
To create the docker image of your generated raw python, we need to create a `Dockerfile`, below is an example. Replace `_THE_GENERATED_PYTHON_FILE_` with your generated python file: | ||
|
||
```Dockerfile | ||
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 | ||
|
||
RUN apt-get update && \ | ||
apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
COPY requirements.txt _THE_GENERATED_PYTHON_FILE_ /_THE_GENERATED_PYTHON_FILE_ | ||
|
||
RUN pip3 install --no-cache-dir -r requirements.txt | ||
|
||
ENV PYTHONUNBUFFERED 1 | ||
|
||
CMD python3 /_THE_GENERATED_PYTHON_FILE_ | ||
``` | ||
|
||
1. [Optional] Lint and remove unused code | ||
|
||
Using `pylint` to validate the generated code is a good practice. Pylint can detect unordered `import` statements, unused code and provide code readability suggestions. | ||
|
||
To use `pylint`, create a new cell in your notebook, run the code below and replace `_THE_GENERATED_PYTHON_FILE_` to your filename: | ||
|
||
```sh | ||
!pip install pylint | ||
!pylint _THE_GENERATED_PYTHON_FILE_ | ||
``` | ||
|
||
## Use nbconvert to convert notebook to raw python | ||
|
||
We can convert a Jupyter notebook to python script using nbconvert tool. | ||
The nbconvert tool is available inside your Jupyter notebook environment in Google Colab Enterprise. If you are in another environment and it is not available, it can be found [here](https://pypi.org/project/nbconvert/) | ||
|
||
1. Run the nbconvert command in your notebook. In this example, we are using `gsutil` to copy the notebook to the Colab Enterprise notebook. | ||
|
||
```sh | ||
!jupyter nbconvert --to python Fine-tune-Llama-Google-Colab.ipynb | ||
``` | ||
|
||
Below is an example of the commands | ||
|
||
![jupyter-nbconvert](images/jupyter-nbconvert.png) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+226 KB
docs/guides/packaging-jupyter-notebooks/images/jupyter-generate-requirements.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+496 KB
docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-ipynb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+204 KB
docs/guides/packaging-jupyter-notebooks/images/jupyter-gpt-j-online-py.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# GKE AI/ML Platform reference architecture for enabling Machine Learning Operations (MLOps) | ||
|
||
## Platform Principles | ||
|
||
This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles: | ||
|
||
- The platform admin will create the GKE platform using IaC tool like [Terraform](https://www.terraform.io/). The IaC will come with re-usable modules that can be referred to create more resources as the demand grows. | ||
- The platform will be based on [GitOps](https://about.gitlab.com/topics/gitops/). | ||
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync](https://cloud.google.com/anthos-config-management/docs/config-sync-overview) by the admins. | ||
- Platform admins will create a namespace per application and provide the application team member full access to it. | ||
- The namespace scoped resources will be created by the Application/ML teams either via Config Sync or through a deployment tool like [Cloud Deploy](https://cloud.google.com/deploy) | ||
|
||
For an outline of products and features used in the platform, see the [Platform Products and Features](products-and-features.md) document. | ||
|
||
## Critical User Journeys (CUJs) | ||
|
||
### Persona : Platform Admin | ||
|
||
- Offer a platform that incorporates established best practices. | ||
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads. | ||
- Establish secure channels for end users to interact seamlessly with the platform. | ||
- Empower the enforcement of robust security policies across the platform. | ||
|
||
### Persona : Machine Learning Engineer | ||
|
||
- Deploy the model with ease and make the endpoints available only to the intended audience | ||
- Continuously monitor the model performance and resource utilization | ||
- Troubleshoot any performance or integration issues | ||
- Ability to version, store and access the models and model artifacts: | ||
- To debug & troubleshoot in production and track back to the specific model version & associated training data | ||
- To quick & controlled rollback to a previous, more stable version | ||
- Implement the feedback loop to adapt to changing data & business needs: | ||
- Ability to retrain / fine-tune the model. | ||
- Ability to split the traffic between models (A/B testing) | ||
- Switching between the models without breaking inference system for the end-users | ||
- Ability to scaling up/down the infra to accommodate changing needs | ||
- Ability to share the insights and findings with stakeholders to take data-driven decisions | ||
|
||
### Persona : Machine Learning Operator | ||
|
||
- Provide and maintain software required by the end users of the platform. | ||
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform. | ||
- Deploy the workloads on the platform. | ||
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations. | ||
|
||
## Prerequisites | ||
|
||
- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial. | ||
- Familiarity with following | ||
- [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) | ||
- [Terraform](https://www.terraform.io/) | ||
- [git](https://git-scm.com/) | ||
- [Google Configuration Management root-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields) | ||
- [Google Configuration Management repo-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields) | ||
- [GitHub](https://github.com/) | ||
|
||
## Deploy the platform | ||
|
||
[Playground Reference Architecture](/platforms/gke-aiml/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts. | ||
|
||
## Use cases | ||
|
||
- [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md) | ||
- [Distributed Data Processing with Ray](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md): Run a distributed data processing job using Ray. | ||
- [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash | ||
- [Fine Tuning Gemma2 9B IT model With FSDP](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP | ||
- [Model evaluation and validation](/use-cases/model-fine-tuning-pipeline/model-eval/README.md): Evaluation and validation of the fine tuned Gemma2 9B IT model | ||
|
||
## Resources | ||
|
||
- [Packaging Jupyter notebooks](/docs/guides/packaging-jupyter-notebooks/README.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Playground AI/ML Platform on GKE: Architecture | ||
|
||
![Playground Architecture](/docs/platforms/gke-aiml/playground/images/architecture.svg) | ||
|
||
## Platform | ||
|
||
- [Google Cloud Project](https://console.cloud.google.com/cloud-resource-manager) | ||
- Environment project | ||
- Service APIs | ||
- [Cloud Storage](https://console.cloud.google.com/storage/browser) | ||
- Terraform bucket | ||
- [VPC networks](https://console.cloud.google.com/networking/networks/list) | ||
- VPC network | ||
- Subnet | ||
- [Cloud Router](https://console.cloud.google.com/hybrid/routers/list) | ||
- Cloud NAT gateway | ||
- [Google Kubernetes Engine (GKE)](https://console.cloud.google.com/kubernetes/list/overview) | ||
- Standard Cluster | ||
- CPU on-demand node pool | ||
- CPU system node pool | ||
- GPU on-demand node pool | ||
- GPU spot node pool | ||
- [Google Kubernetes Engine (GKE) Enterprise](https://cloud.google.com/kubernetes-engine/enterprise/docs) | ||
- Configuration Management | ||
- Config Sync | ||
- Policy Controller | ||
- Connect gateway | ||
- Fleet | ||
- Security posture dashboard | ||
- Threat detection | ||
- Git repository | ||
- Config Sync | ||
|
||
### Each namespace | ||
|
||
- [Load Balancer](https://console.cloud.google.com/net-services/loadbalancing/list/loadBalancers) | ||
- Gateway External Load Balancer | ||
- [Classic SSL Certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates) | ||
- Gateway SSL Certificate | ||
- Ray dashboard | ||
- [Identity-Aware Proxy (IAP)](https://cloud.google.com/iap/docs/concepts-overview) | ||
- Ray head Backend Service | ||
- [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccount) | ||
- Default | ||
- Ray head | ||
- Ray worker |
263 changes: 263 additions & 0 deletions
263
docs/platforms/gke-aiml/playground/images/architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
339 changes: 339 additions & 0 deletions
339
docs/platforms/gke-aiml/playground/images/architecture_full.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.