publish cn-ml-on-eks

awsdocs · Nov 7, 2024 · d2b73c6 · d2b73c6
1 parent 34ea7c6
commit d2b73c6
Show file tree

Hide file tree

Showing 19 changed files with 344 additions and 172 deletions.
diff --git a/latest/ug/book.adoc b/latest/ug/book.adoc
@@ -74,6 +74,8 @@ include::connector/eks-connector.adoc[leveloffset=+1]
 
 include::outposts/eks-outposts.adoc[leveloffset=+1]
 
+include::ml/machine-learning-on-eks.adoc[leveloffset=+1]
+
 include::related-projects.adoc[leveloffset=+1]
 
 include::roadmap.adoc[leveloffset=+1]

diff --git a/latest/ug/integrations/deep-learning-containers.adoc b/latest/ug/integrations/deep-learning-containers.adoc
diff --git a/latest/ug/integrations/eks-integrations.adoc b/latest/ug/integrations/eks-integrations.adoc
@@ -22,9 +22,6 @@ In addition to the services covered in other sections, Amazon EKS works with mor
 include::creating-resources-with-cloudformation.adoc[leveloffset=+1]
 
 
-include::deep-learning-containers.adoc[leveloffset=+1]
-
-
 include::integration-detective.adoc[leveloffset=+1]
 
 

diff --git a/latest/ug/nodes/capacity-blocks.adoc → latest/ug/ml/capacity-blocks.adoc b/latest/ug/nodes/capacity-blocks.adoc → latest/ug/ml/capacity-blocks.adoc
@@ -3,7 +3,7 @@ include::../attributes.txt[]
 [.topic]
 [[capacity-blocks,capacity-blocks.title]]
 = Create self-managed nodes with Capacity Blocks for ML
-:info_titleabbrev: Capacity Blocks for ML
+:info_titleabbrev: Reserve GPUs 
 
 [abstract]
 --
@@ -46,7 +46,6 @@ Make sure the `LaunchTemplateData` includes the following:
 
 +
 The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block.  
-+
 [source,yaml,subs="verbatim,attributes,quotes"]
 ----
 NodeLaunchTemplate:
@@ -67,7 +66,6 @@ NodeLaunchTemplate:
       - sg-05b1d815d1EXAMPLE
       UserData: user-data
 ----
-+
 You must pass the subnet in the Availability Zone in which the reservation is made because Capacity Blocks are zonal.
 . Use the launch template to create a self-managed node group. If you're doing this prior to the capacity reservation becoming active, then set the desired capacity to `0`. When creating the node group, make sure that you are only specifying the respective subnet for the Availability Zone in which the capacity is reserved.
 +

diff --git a/latest/ug/workloads/inferentia-support.adoc → latest/ug/ml/inferentia-support.adoc b/latest/ug/workloads/inferentia-support.adoc → latest/ug/ml/inferentia-support.adoc
@@ -1,15 +1,14 @@
 //!!NODE_ROOT <section>
+include::../attributes.txt[]
 
 [.topic]
 [[inferentia-support,inferentia-support.title]]
-= Deploy [.noloc]`ML` inference workloads with {aws}[.noloc]`Inferentia` on Amazon EKS
+= Use {aws} [.noloc]`Inferentia` workloads with Amazon EKS for Machine Learning
 :info_doctype: section
-:info_title: Deploy ML inference workloads with AWSInferentia on Amazon EKS
-:info_titleabbrev: Machine learning inference
+:info_title: Use {aws} Inferentia workloads with your EKS cluster for Machine Learning 
+:info_titleabbrev: Create {aws} Inferentia cluster
 :info_abstract: Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application.
 
-include::../attributes.txt[]
-
 [abstract]
 --
 Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application.

diff --git a/latest/ug/ml/machine-learning-on-eks.adoc b/latest/ug/ml/machine-learning-on-eks.adoc
@@ -0,0 +1,68 @@
+//!!NODE_ROOT <chapter>
+include::../attributes.txt[]
+[.topic]
+[[machine-learning-on-eks,machine-learning-on-eks.title]]
+= Overview of Machine Learning on Amazon EKS
+:doctype: book
+:sectnums:
+:toc: left
+:icons: font
+:experimental:
+:idprefix:
+:idseparator: -
+:sourcedir: .
+:info_doctype: chapter
+:info_title: Machine Learning on Amazon EKS Overview
+:info_titleabbrev: Machine Learning on EKS
+:keywords: Machine Learning, Amazon EKS, Artificial Intelligence
+:info_abstract: Learn to manage containerized applications with Amazon EKS
+
+[abstract]
+--
+Complete guide for running Machine Learning applications on Amazon EKS. This includes everything from provisioning infrastructure to choosing and deploying Machine Learning workloads on Amazon EKS.
+--
+
+[[ml-features,ml-features.title]]
+
+Machine Learning (ML) is an area of Artificial Intelligence (AI) where machines process large amounts of data to look for patterns and make connections between the data. This can expose new relationships and help predict outcomes that might not have been apparent otherwise.
+
+For large-scale ML projects, data centers must be able to store large amounts of data, process data quickly, and integrate data from many sources. The platforms running ML applications must be reliable and secure, but also offer resiliency to recover from data center outages and application failures. {aws} Elastic Kubernetes Service (EKS), running in the {aws} cloud, is particularly suited for ML workloads. 
+
+The primary goal of this section of the EKS User Guide is to help you put together the hardware and software component to build platforms to run Machine Learning workloads in an EKS cluster.
+We start by explaining the features and services available to you in EKS and the {aws} cloud, then provide you with tutorials to help you work with ML platforms, frameworks, and models.
+
+=== Advantages of Machine Learning on EKS and the {aws} cloud
+
+Amazon Elastic Kubernetes Service (EKS) is a powerful, managed Kubernetes platform that has become a cornerstone for deploying and managing AI/ML workloads in the cloud. With its ability to handle complex, resource-intensive tasks, Amazon EKS provides a scalable and flexible foundation for running AI/ML models, making it an ideal choice for organizations aiming to harness the full potential of machine learning.
+
+Key Advantages of AI/ML Platforms on Amazon EKS include:
+
+* *Scalability and Flexibility*
+Amazon EKS enables organizations to scale AI/ML workloads seamlessly. Whether you're training large language models that require vast amounts of compute power or deploying inference pipelines that need to handle unpredictable traffic patterns, EKS scales up and down efficiently, optimizing resource use and cost.
+
+* *High Performance with GPUs and Neuron Instances*
+Amazon EKS supports a wide range of compute options, including GPUs and {aws}} Neuron instances, which are essential for accelerating AI/ML workloads. This support allows for high-performance training and low-latency inference, ensuring that models run efficiently in production environments.
+
+* *Integration with AI/ML Tools*
+Amazon EKS integrates seamlessly with popular AI/ML tools and frameworks like TensorFlow, PyTorch, and Ray, providing a familiar and robust ecosystem for data scientists and engineers. These integrations enable users to leverage existing tools while benefiting from the scalability and management capabilities of Kubernetes.
+
+* *Automation and Management*
+Kubernetes on Amazon EKS automates many of the operational tasks associated with managing AI/ML workloads. Features like automatic scaling, rolling updates, and self-healing ensure that your applications remain highly available and resilient, reducing the overhead of manual intervention.
+
+* *Security and Compliance*
+Running AI/ML workloads on Amazon EKS provides robust security features, including fine-grained IAM roles, encryption, and network policies, ensuring that sensitive data and models are protected. EKS also adheres to various compliance standards, making it suitable for enterprises with strict regulatory requirements.
+
+=== Why Choose Amazon EKS for AI/ML?
+Amazon EKS offers a comprehensive, managed environment that simplifies the deployment of AI/ML models while providing the performance, scalability, and security needed for production workloads. With its ability to integrate with a variety of AI/ML tools and its support for advanced compute resources, EKS empowers organizations to accelerate their AI/ML initiatives and deliver innovative solutions at scale.
+
+By choosing Amazon EKS, you gain access to a robust infrastructure that can handle the complexities of modern AI/ML workloads, allowing you to focus on innovation and value creation rather than managing underlying systems. Whether you are deploying simple models or complex AI systems, Amazon EKS provides the tools and capabilities needed to succeed in a competitive and rapidly evolving field.
+
+=== Start using Machine Learning on EKS
+
+To begin planning for and using Machine Learning platforms and workloads on EKS on the {aws} cloud, proceed to the <<ml-get-started>> section.
+
+include::ml-get-started.adoc[leveloffset=+1]
+
+include::ml-prepare-for-cluster.adoc[leveloffset=+1]
+
+include::ml-tutorials.adoc[leveloffset=+1]
diff --git a/latest/ug/ml/ml-eks-optimized-ami.adoc b/latest/ug/ml/ml-eks-optimized-ami.adoc
@@ -0,0 +1,87 @@
+//!!NODE_ROOT <section>
+[.topic]
+[[ml-eks-optimized-ami,ml-eks-optimized-ami.title]]
+= Create nodes with EKS optimized accelerated Amazon Linux AMIs
+:info_titleabbrev: Run GPU AMIs
+
+include::../attributes.txt[]
+
+The Amazon EKS optimized accelerated Amazon Linux AMI is built on top of the standard Amazon EKS optimized Amazon Linux AMI. For details on these AMIs, see <<gpu-ami>>.
+The following text describes how to enable {aws} Neuron-based workloads.
+
+.To enable {aws} Neuron (ML accelerator) based workloads
+For details on training and inference workloads using [.noloc]`Neuron` in Amazon EKS, see the following references:
+
+* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the _{aws} [.noloc]`Neuron` Documentation_
+* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training] in {aws} [.noloc]`Neuron` EKS Samples on GitHub
+* <<inferentia-support,Deploy ML inference workloads with AWSInferentia on Amazon EKS>>
+
+The following procedure describes how to run a workload on a GPU based instance with the Amazon EKS optimized accelerated AMI.
+
+. After your GPU nodes join your cluster, you must apply the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA device plugin for Kubernetes] as a [.noloc]`DaemonSet` on your cluster. Replace [.replaceable]`vX.X.X` with your desired https://github.com/NVIDIA/k8s-device-plugin/releases[NVIDIA/k8s-device-plugin] version before running the following command.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml
+----
+. You can verify that your nodes have allocatable GPUs with the following command.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
+----
+. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`tag` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node.
++
+[source,yaml,subs="verbatim,attributes,quotes"]
+----
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nvidia-smi
+spec:
+  restartPolicy: OnFailure
+  containers:
+  - name: nvidia-smi
+    image: nvidia/cuda:tag
+    args:
+    - "nvidia-smi"
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+----
+. Apply the manifest with the following command.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+kubectl apply -f nvidia-smi.yaml
+----
+. After the [.noloc]`Pod` has finished running, view its logs with the following command.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+kubectl logs nvidia-smi
+----
++
+An example output is as follows.
++
+[source,bash,subs="verbatim,attributes,quotes"]
+----
+Mon Aug  6 20:23:31 20XX
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI XXX.XX                 Driver Version: XXX.XX                    |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
+| N/A   46C    P0    47W / 300W |      0MiB / 16160MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
++-----------------------------------------------------------------------------+
+| Processes:                                                       GPU Memory |
+|  GPU       PID   Type   Process name                             Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
++-----------------------------------------------------------------------------+
+----
+
+
diff --git a/latest/ug/ml/ml-get-started.adoc b/latest/ug/ml/ml-get-started.adoc
@@ -0,0 +1,51 @@
+//!!NODE_ROOT <section>
+
+[.topic]
+[[ml-get-started,ml-get-started.title]]
+= Get started with ML
+:info_doctype: section
+:info_title: Get started deploying Machine Learning tools on EKS
+:info_titleabbrev: Get started with ML
+:info_abstract: Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy them to the {aws} cloud.
+
+include::../attributes.txt[]
+
+
+[abstract]
+--
+Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy ML workloads and EKS clusters to the {aws} cloud.
+--
+
+To jump into Machine Learning on EKS, start by choosing from these prescriptive patterns to quickly get an EKS cluster and ML software and hardware ready to begin running ML workloads. Most of these patterns are based on Terraform blueprints that are available from the https://awslabs.github.io/data-on-eks/docs/introduction/intro[Data on Amazon EKS] site. Before you begin, here are few things to keep in mind:
+
+* GPUs or Neuron instances are required to run these procedures. Lack of availability of these resources can cause these procedures to fail during cluster creation or node autoscaling.
+* Neuron SDK (Tranium and Inferentia-based instances) can save money and are more available than NVIDIA GPUs. So, when your worloads permit it, we recommend that you consider using Neutron for your Machine Learning workloads (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/[Welcome to {aws} Neuron]).
+* Some of the getting started experiences here require that you get data via your own https://huggingface.co/[Hugging Face] account.
+
+To get started, choose from the following selection of patterns that are designed to get you started setting up infrastructure to run your Machine Learning workloads:
+
+* *https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub on EKS]*: Explore the https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub blueprint], which showcases Time Slicing and MIG features, as well as multi-tenant configurations with profiles. This is ideal for deploying large-scale JupyterHub platforms on EKS.
+* *https://aws.amazon.com/ai/machine-learning/neuron/[Large Language Models on {aws} Neuron and RayServe]*: Use https://aws.amazon.com/ai/machine-learning/neuron/[{aws} Neuron] to run large language models (LLMs) on Amazon EKS and {aws} Trainium and {aws} Inferentia accelerators. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/Neuron/vllm-ray-inf2[Serving LLMs with RayServe and vLLM on {aws} Neuron] for instructions on setting up a platform for making inference requests, with components that include:
++
+** {aws} Neuron SDK toolkit for deep learning
+** {aws} Inferentia and Trainium accelerators
+** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site)
+** RayServe scalable model serving library (see the https://docs.ray.io/en/latest/serve/index.html[Ray Serve: Scalable and Programmable Serving] site)
+** Llama-3 language model, using your own https://huggingface.co/[Hugging Face] account.
+** Observability with {aws} CloudWatch and Neuron Monitor
+** Open WebUI
+* *https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Large Language Models on NVIDIA and Triton]*: Deploy multiple large language models (LLMs) on Amazon EKS and NVIDIA GPUs. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Deploying Multiple Large Language Models with NVIDIA Triton Server and vLLM] for instructions for setting up a platform for making inference requests, with components that include:
++
+** NVIDIA Triton Inference Server (see the https://github.com/triton-inference-server/server[Triton Inference Server] GitHub site)
+** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site)
+** Two language models: mistralai/Mistral-7B-Instruct-v0.2 and meta-llama/Llama-2-7b-chat-hf, using your own https://huggingface.co/[Hugging Face] account.
+
+=== Continuing with ML on EKS
+
+Along with choosing from the blueprints described on this page, there are other ways you can proceed through the ML on EKS documentation if you prefer. For example, you can:
+
+* *Try tutorials for ML on EKS* – Run other end-to-end tutorials for building and running your own Machine Learning models on EKS. See <<ml-tutorials>>.
+
+To improve your work with ML on EKS, refer to the following:
+
+* *Prepare for ML* – Learn how to prepare for ML on EKS with features like custom AMIs and GPU reservations. See <<ml-prepare-for-cluster>>.
diff --git a/latest/ug/ml/ml-prepare-for-cluster.adoc b/latest/ug/ml/ml-prepare-for-cluster.adoc
@@ -0,0 +1,44 @@
+//!!NODE_ROOT <section>
+
+[.topic]
+[[ml-prepare-for-cluster,ml-prepare-for-cluster.title]]
+= Prepare for ML clusters
+:info_doctype: section
+:info_title: Prepare to create an EKS cluster for Machine Learning
+:info_titleabbrev: Prepare for ML
+:info_abstract: Learn how to make decisions about CPU, AMIs, and tooling before creating an EKS cluster for ML.
+
+include::../attributes.txt[]
+
+
+[abstract]
+--
+Learn how to make decisions about CPU, AMIs, and tooling before creating an EKS cluster for ML.
+--
+
+There are ways that you can enhance your Machine Learning on EKS experience. 
+Following pages in this section will help you:
+
+* Understand your choices for using ML on EKS and
+* Help in preparation of your EKS and ML environment.
+
+In particular, this will help you:
+
+* *Choose AMIs*: {aws} offers multiple customized AMIs for running ML workloads on EKS. See <<ml-eks-optimized-ami>>.
+* *Customize AMIs*: You can further modify {aws} custom AMIs to add other software and drivers needed for your particular use cases. See <<capacity-blocks>>.
+* *Reserve GPUs*: Because of the demand for GPUs, to ensure that the GPUs you need are available when you need them, you can reserve the GPUs you need in advance. See <<node-taints-managed-node-groups>>.
+* *Add EFA*: Add the Elastic Fabric Adapter to improve network performance for inter-node cluster communications. See <<node-efa>>.
+* *Use AWSInferentia workloads*: Create an EKS cluster with Amazon EC2 Inf1 instances. See <<inferentia-support>>.
+
+[.topiclist]
+[[Topic List]]
+
+include::ml-eks-optimized-ami.adoc[leveloffset=+1]
+
+include::capacity-blocks.adoc[leveloffset=+1]
+
+include::node-taints-managed-node-groups.adoc[leveloffset=+1]
+
+include::node-efa.adoc[leveloffset=+1]
+
+include::inferentia-support.adoc[leveloffset=+1]