Skip to content

Commit

Permalink
Restructured repository
Browse files Browse the repository at this point in the history
  • Loading branch information
arueth committed Sep 17, 2024
1 parent 36b66dc commit 065038f
Show file tree
Hide file tree
Showing 168 changed files with 76 additions and 86 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Google Cloud AI/ML Platform Reference Architectures
# Google Cloud Accelerated Platform Reference Architectures

This repository is collection of AI/ML platform reference architectures and use cases for Google Cloud.
This repository is collection of accelerated platform reference architectures and use cases for Google Cloud.

- [GKE ML Platform for enabling ML Ops](/docs/gke-ml-platform.md)
- [GKE AI/ML Platform for enabling AI/ML Ops](/docs/platforms/gke-aiml/README.md)
- [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md)
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ This process can also be automated utilizing Continuous Integration (CI) tools s

To pair the notebook, simply use the pair function in the File menu:

![jupyter-pairing](../images/notebook/jupyter-pairing.png)
![jupyter-pairing](images/jupyter-pairing.png)

In this example we use the file [gpt-j-online.ipynb](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/examples/notebooks/gpt-j-online.ipynb):![jupyter-gpt-j-online-ipynb](/docs/images/notebook/jupyter-gpt-j-online-ipynb.png)
In this example we use the file [gpt-j-online.ipynb](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/examples/notebooks/gpt-j-online.ipynb):![jupyter-gpt-j-online-ipynb](images/jupyter-gpt-j-online-ipynb.png)

1. After pairing, we get the generated raw python:

![jupyter-gpt-j-online-py](../images/notebook/jupyter-gpt-j-online-py.png)
![jupyter-gpt-j-online-py](images/jupyter-gpt-j-online-py.png)

**NOTE**: This conversion can also be performed via the `jupytext` cli with the following command:

Expand All @@ -45,7 +45,7 @@ This process can also be automated utilizing Continuous Integration (CI) tools s

The following is an example output:

![jupyter-generate-requirements](../images/notebook/jupyter-generate-requirements.png)
![jupyter-generate-requirements](images/jupyter-generate-requirements.png)
**NOTE**: (the `!cat requirements.txt` line is an example of the generated `requirements.txt`)

1. Create the Dockerfile
Expand Down Expand Up @@ -91,4 +91,5 @@ The nbconvert tool is available inside your Jupyter notebook environment in Goog
```

Below is an example of the commands
![jupyter-nbconvert](../images/notebook/jupyter-nbconvert.png)

![jupyter-nbconvert](images/jupyter-nbconvert.png)
File renamed without changes
File renamed without changes
File renamed without changes
Binary file removed docs/images/use-case/TensorBoard.png
Binary file not shown.
46 changes: 17 additions & 29 deletions docs/gke-ml-platform.md → docs/platforms/gke-aiml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:

- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
- The platform will be based on [GitOps][gitops].
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins.
- The platform admin will create the GKE platform using IaC tool like [Terraform](https://www.terraform.io/). The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
- The platform will be based on [GitOps](https://about.gitlab.com/topics/gitops/).
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync](https://cloud.google.com/anthos-config-management/docs/config-sync-overview) by the admins.
- Platform admins will create a namespace per application and provide the application team member full access to it.
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]
- The namespace scoped resources will be created by the Application/ML teams either via Config Sync or through a deployment tool like [Cloud Deploy](https://cloud.google.com/deploy)

For an outline of products and features used in the platform, see the [Platform Products and Features](/docs/gke-ml-platform/products-and-features.md) document.
For an outline of products and features used in the platform, see the [Platform Products and Features](products-and-features.md) document.

## Critical User Journeys (CUJs)

Expand Down Expand Up @@ -47,36 +47,24 @@ For an outline of products and features used in the platform, see the [Platform

- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
- Familiarity with following
- [Google Kubernetes Engine][gke]
- [Terraform][terraform]
- [git][git]
- [Google Configuration Management root-sync][root-sync]
- [Google Configuration Management repo-sync][repo-sync]
- [GitHub][github]
- [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
- [Terraform](https://www.terraform.io/)
- [git](https://git-scm.com/)
- [Google Configuration Management root-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields)
- [Google Configuration Management repo-sync](https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields)
- [GitHub](https://github.com/)

## Deploy the platform

[Playground Reference Architecture](/examples/platform/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.
[Playground Reference Architecture](/platforms/gke-aiml/playground/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.

## Use cases

- [Distributed Data Processing with Ray](/examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray.
- [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
- [Fine Tuning Gemma2 9B IT model With FSDP](/examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP
- [Model Fine Tuning Pipeline](/docs/use-cases/model-fine-tuning-pipeline/README.md)
- [Distributed Data Processing with Ray](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md): Run a distributed data processing job using Ray.
- [Dataset Preparation for Fine Tuning Gemma IT With Gemini Flash](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
- [Fine Tuning Gemma2 9B IT model With FSDP](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP

## Resources

- [Packaging Jupyter notebooks](/docs/notebook/packaging.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.

[gitops]: https://about.gitlab.com/topics/gitops/
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview
[cloud-deploy]: https://cloud.google.com/deploy?hl=en
[terraform]: https://www.terraform.io/
[gke]: https://cloud.google.com/kubernetes-engine?hl=en
[git]: https://git-scm.com/
[github]: https://github.com/
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts
- [Packaging Jupyter notebooks](/docs/guides/packaging-jupyter-notebooks/README.md): Patterns and tools to get your ipynb's ready for deployment in a container runtime.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Playground Machine learning platform (MLP) on GKE: Architecture

![Playground Architecture](/docs/images/platform/playground/mlp_playground_architecture.svg)
![Playground Architecture](/docs/platforms/gke-aiml/playground/images/architecture.svg)

## Platform

Expand All @@ -20,7 +20,7 @@
- CPU system node pool
- GPU on-demand node pool
- GPU spot node pool
- Google Kubernetes Engine (GKE) Enterprise ([docs])(https://cloud.google.com/kubernetes-engine/enterprise/docs)
- [Google Kubernetes Engine (GKE) Enterprise](https://cloud.google.com/kubernetes-engine/enterprise/docs)
- Configuration Management
- Config Sync
- Policy Controller
Expand All @@ -38,7 +38,7 @@
- [Classic SSL Certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates)
- Gateway SSL Certificate
- Ray dashboard
- Identity-Aware Proxy (IAP) ([docs])(https://cloud.google.com/iap/docs/concepts-overview)
- [Identity-Aware Proxy (IAP)](https://cloud.google.com/iap/docs/concepts-overview)
- Ray head Backend Service
- [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccount)
- Default
Expand Down
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This document outlines the products and features that are used in the platform.

![Playground Architecture](/docs/images/platform/playground/mlp_playground_architecture_full.svg)
![Playground Architecture](/docs/platforms/gke-aiml/playground/images/architecture_full.svg)

## Cloud Logging

Expand Down Expand Up @@ -278,4 +278,4 @@ For more information see the [Fleet management documentation](https://cloud.goog

Policy Controller enables the application and enforcement of programmable policies for your Kubernetes clusters. These policies act as guardrails and can help with best practices, security, and compliance management of your clusters and fleet. Based on the open source Open Policy Agent Gatekeeper project, Policy Controller is fully integrated with Google Cloud, includes a built-in dashboard, for observability, and comes with a full library of pre-built policies for common security and compliance controls.

For more information see the [Policy Controller documentation](https://cloud.google.com/kubernetes-engine/enterprise/policy-controller/docs/overview
For more information see the [Policy Controller documentation](https://cloud.google.com/kubernetes-engine/enterprise/policy-controller/docs/overview)
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Within each topic, we will share methodologies, frameworks, tools, and lessons l

## Process Flow for Implementing the ML Use Case End to End

![MLOps workflow](/docs/images/use-case/MLOps_e2e.png)
![MLOps workflow](images/MLOps_e2e.png)

## Data Preprocessing

Expand All @@ -27,7 +27,7 @@ The data preprocessing phase in MLOps is foundational. It directly impacts the q

We are leveraging a [pre-crawled public dataset](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products), taken as a subset (20000) of a bigger dataset (more than 5.8 million products) that was created by extracting data from [Flipkart](https://www.flipkart.com/), a leading Indian eCommerce store.

![Dataset Snapshot](/docs/images/use-case/dataset_info.png)
![Dataset Snapshot](images/dataset_info.png)

### Data Preprocessing Steps

Expand All @@ -46,7 +46,7 @@ Now, consider the scenario where a preprocessing task involves extracting multip

To tackle this scalability issue, we turn to parallelism. By breaking the dataset into smaller chunks and distributing the processing across multiple threads or processes, we can drastically reduce the overall execution time.

For implementation steps, please check this document [Distributed Data Preprocessing with Ray](/examples/use-case/data-processing/ray/README.md)
For implementation steps, please check this document [Distributed Data Preprocessing with Ray](/use-cases/model-fine-tuning-pipeline/data-processing/ray/README.md)

## Data Preparation

Expand Down Expand Up @@ -87,7 +87,7 @@ The 'End Of Sequence' token was appended to each prompt.

EOS_TOKEN \= '\<eos\>'

For implementation steps, please check this document [Data preparation for fine tuning Gemma IT model](/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/README.md)
For implementation steps, please check this document [Data preparation for fine tuning Gemma IT model](/use-cases/model-fine-tuning-pipeline/data-preparation/gemma-it/README.md)

## Fine-tuning

Expand Down Expand Up @@ -160,9 +160,9 @@ mlflow.autolog()

This will allow MLflow to track fine-tuning parameters, results, and system-level metrics like CPU, memory, and GPU utilization. This added layer of monitoring provides a comprehensive view of batch fine-tuning jobs, making it easier to compare different configurations and results.

![epoch_vs_loss](/docs/images/use-case/mlflow_epoch_loss.png)
![epoch_vs_loss](images/mlflow_epoch_loss.png)

![ml_flow_](/docs/images/use-case/MLFlow_experiment_tracking.png)
![ml_flow_](images/MLFlow_experiment_tracking.png)

Alternative solutions, such as MLflow and Weights & Biases, offer additional capabilities. While MLflow provides comprehensive pipeline features, our immediate requirements are satisfied by its core tracking functionality.

Expand Down Expand Up @@ -214,4 +214,4 @@ Above evaluation provides a granular understanding of the model's performance. B
- Ensure the test dataset accurately reflects the real-world data distribution the model will encounter in production.
- Consider the size of the test dataset to assess the statistical significance of the results.

For implementation steps, please check this document [Fine tuning Gemma IT model](/examples/use-case/fine-tuning/pytorch/README.md)
For implementation steps, please check this document [Fine tuning Gemma IT model](/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/README.md)
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ This quick-start deployment guide can be used to set up an environment to famili

## Architecture

For more information about the architecture, see the [Playground Machine learning platform (MLP) on GKE: Architecture](/docs/gke-ml-platform/playground/architecture.md) document.
For more information about the architecture, see the [Playground Machine learning platform (MLP) on GKE: Architecture](/docs/platforms/gke-aiml/playground/architecture.md) document.

For an outline of products and features used in the platform, see the [Platform Products and Features](/docs/gke-ml-platform/products-and-features.md) document.
For an outline of products and features used in the platform, see the [Platform Products and Features](/docs/platforms/gke-aiml/products-and-features.md) document.

## Requirements

Expand Down Expand Up @@ -53,7 +53,7 @@ The default quota given to a project should be sufficient for this guide.
```

```
cd examples/platform/playground && \
cd platforms/gke-aiml/playground && \
export MLP_TYPE_BASE_DIR=$(pwd) && \
sed -n -i -e '/^export MLP_TYPE_BASE_DIR=/!p' -i -e '$aexport MLP_TYPE_BASE_DIR="'"${MLP_TYPE_BASE_DIR}"'"' ${HOME}/.bashrc
```
Expand Down Expand Up @@ -249,7 +249,7 @@ See the [Configuring the OAuth consent screen documentation](https://developers.
- Click **SAVE AND CONTINUE**
- On the **Summary** page, click **BACK TO DASHBOARD**
- The **OAuth consent screen** should now look like this:
![oauth consent screen](/docs/images/platform/oauth-consent-screen.png)
![oauth consent screen](/docs/platforms/gke-aiml/playground/images/oauth-consent-screen.png)

### Default IAP access

Expand Down Expand Up @@ -320,7 +320,7 @@ Before running Terraform, make sure that the Service Usage API is enable.

- Go to Google Cloud Console, click on the navigation menu and click on [Kubernetes Engine](https://console.cloud.google.com/kubernetes) > [Config](https://console.cloud.google.com/kubernetes/config_management/dashboard).
If you haven't enabled GKE Enterprise in the project earlier, Click `LEARN AND ENABLE` button and then `ENABLE GKE ENTERPRISE`. You should see a RootSync and RepoSync object.
![configsync](/docs/images/platform/configsync.png)
![configsync](/docs/platforms/gke-aiml/playground/images/configsync.png)

### Software installed via RepoSync and RootSync

Expand Down Expand Up @@ -495,8 +495,8 @@ You only need to complete the section for the option that you have selected.
```
cd ${MLP_BASE_DIR} && \
git restore \
examples/platform/playground/backend.tf \
examples/platform/playground/mlp.auto.tfvars \
platforms/gke-aiml/playground/backend.tf \
platforms/gke-aiml/playground/mlp.auto.tfvars \
terraform/features/initialize/backend.tf \
terraform/features/initialize/backend.tf.bucket \
terraform/features/initialize/initialize.auto.tfvars
Expand All @@ -507,8 +507,8 @@ You only need to complete the section for the option that you have selected.
```
cd ${MLP_BASE_DIR} && \
rm -rf \
examples/platform/playground/.terraform \
examples/platform/playground/.terraform.lock.hcl \
platforms/gke-aiml/playground/.terraform \
platforms/gke-aiml/playground/.terraform.lock.hcl \
terraform/features/initialize/.terraform \
terraform/features/initialize/.terraform.lock.hcl \
terraform/features/initialize/backend.tf.local \
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
8 changes: 4 additions & 4 deletions test/scripts/helpers/byop_playground_cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ gcloud storage buckets delete gs://${MLP_STATE_BUCKET} --project ${MLP_PROJECT_I
echo_title "Cleaning up local repository changes"
cd ${MLP_BASE_DIR} &&
git restore \
examples/platform/playground/backend.tf \
examples/platform/playground/mlp.auto.tfvars
platforms/gke-aiml/playground/backend.tf \
platforms/gke-aiml/playground/mlp.auto.tfvars

cd ${MLP_BASE_DIR} &&
rm -rf \
examples/platform/playground/${TF_DATA_DIR} \
examples/platform/playground/.terraform.lock.hcl
platforms/gke-aiml/playground/${TF_DATA_DIR} \
platforms/gke-aiml/playground/.terraform.lock.hcl
8 changes: 4 additions & 4 deletions test/scripts/helpers/new_gh_playground_cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,16 @@ echo_title "Cleaning up local repository changes"

print_and_execute_no_check "cd ${MLP_BASE_DIR} &&
git restore \
examples/platform/playground/backend.tf \
examples/platform/playground/mlp.auto.tfvars \
platforms/gke-aiml/playground/backend.tf \
platforms/gke-aiml/playground/mlp.auto.tfvars \
terraform/features/initialize/backend.tf \
terraform/features/initialize/backend.tf.bucket \
terraform/features/initialize/initialize.auto.tfvars"

print_and_execute_no_check "cd ${MLP_BASE_DIR} &&
rm -rf \
examples/platform/playground/.terraform \
examples/platform/playground/.terraform.lock.hcl \
platforms/gke-aiml/playground/.terraform \
platforms/gke-aiml/playground/.terraform.lock.hcl \
terraform/features/initialize/.terraform \
terraform/features/initialize/.terraform.lock.hcl \
terraform/features/initialize/backend.tf.local \
Expand Down
Loading

0 comments on commit 065038f

Please sign in to comment.