diff --git a/latest/ug/book.adoc b/latest/ug/book.adoc index 6315faee..41a51748 100644 --- a/latest/ug/book.adoc +++ b/latest/ug/book.adoc @@ -74,6 +74,8 @@ include::connector/eks-connector.adoc[leveloffset=+1] include::outposts/eks-outposts.adoc[leveloffset=+1] +include::ml/machine-learning-on-eks.adoc[leveloffset=+1] + include::related-projects.adoc[leveloffset=+1] include::roadmap.adoc[leveloffset=+1] diff --git a/latest/ug/integrations/deep-learning-containers.adoc b/latest/ug/integrations/deep-learning-containers.adoc deleted file mode 100644 index dded2f96..00000000 --- a/latest/ug/integrations/deep-learning-containers.adoc +++ /dev/null @@ -1,12 +0,0 @@ -//!!NODE_ROOT
-include::../attributes.txt[] - -[.topic] -[[deep-learning-containers,deep-learning-containers.title]] -= Train and serve TensorFlow models on EKS with Deep Learning Containers -:info_doctype: section -:info_title: Train and serve TensorFlow models on EKS with Deep Learning Containers - -{aws} Deep Learning Containers are a set of [.noloc]`Docker` images for training and serving models in TensorFlow on Amazon EKS and Amazon Elastic Container Service (Amazon ECS). Deep Learning Containers provide optimized environments with https://www.tensorflow.org/[TensorFlow], https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] (for GPU instances), and https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html[Intel MKL] (for CPU instances) libraries and are available in Amazon ECR. - -To get started using {aws} Deep Learning Containers on Amazon EKS, see link:dlami/latest/devguide/deep-learning-containers-eks.html[Amazon EKS Setup,type="documentation"] in the _{aws} Deep Learning Containers Developer Guide_. \ No newline at end of file diff --git a/latest/ug/integrations/eks-integrations.adoc b/latest/ug/integrations/eks-integrations.adoc index 5257fbe3..50107196 100644 --- a/latest/ug/integrations/eks-integrations.adoc +++ b/latest/ug/integrations/eks-integrations.adoc @@ -22,9 +22,6 @@ In addition to the services covered in other sections, Amazon EKS works with mor include::creating-resources-with-cloudformation.adoc[leveloffset=+1] -include::deep-learning-containers.adoc[leveloffset=+1] - - include::integration-detective.adoc[leveloffset=+1] diff --git a/latest/ug/nodes/capacity-blocks.adoc b/latest/ug/ml/capacity-blocks.adoc similarity index 99% rename from latest/ug/nodes/capacity-blocks.adoc rename to latest/ug/ml/capacity-blocks.adoc index 3dfce3e7..578c02cc 100644 --- a/latest/ug/nodes/capacity-blocks.adoc +++ b/latest/ug/ml/capacity-blocks.adoc @@ -3,7 +3,7 @@ include::../attributes.txt[] [.topic] [[capacity-blocks,capacity-blocks.title]] = Create self-managed nodes with Capacity Blocks for ML -:info_titleabbrev: Capacity Blocks for ML +:info_titleabbrev: Reserve GPUs [abstract] -- @@ -46,7 +46,6 @@ Make sure the `LaunchTemplateData` includes the following: + The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block. -+ [source,yaml,subs="verbatim,attributes,quotes"] ---- NodeLaunchTemplate: @@ -67,7 +66,6 @@ NodeLaunchTemplate: - sg-05b1d815d1EXAMPLE UserData: user-data ---- -+ You must pass the subnet in the Availability Zone in which the reservation is made because Capacity Blocks are zonal. . Use the launch template to create a self-managed node group. If you're doing this prior to the capacity reservation becoming active, then set the desired capacity to `0`. When creating the node group, make sure that you are only specifying the respective subnet for the Availability Zone in which the capacity is reserved. + diff --git a/latest/ug/workloads/inferentia-support.adoc b/latest/ug/ml/inferentia-support.adoc similarity index 98% rename from latest/ug/workloads/inferentia-support.adoc rename to latest/ug/ml/inferentia-support.adoc index 81c05cfe..2db07629 100644 --- a/latest/ug/workloads/inferentia-support.adoc +++ b/latest/ug/ml/inferentia-support.adoc @@ -1,15 +1,14 @@ //!!NODE_ROOT
+include::../attributes.txt[] [.topic] [[inferentia-support,inferentia-support.title]] -= Deploy [.noloc]`ML` inference workloads with {aws}[.noloc]`Inferentia` on Amazon EKS += Use {aws} [.noloc]`Inferentia` workloads with Amazon EKS for Machine Learning :info_doctype: section -:info_title: Deploy ML inference workloads with AWSInferentia on Amazon EKS -:info_titleabbrev: Machine learning inference +:info_title: Use {aws} Inferentia workloads with your EKS cluster for Machine Learning +:info_titleabbrev: Create {aws} Inferentia cluster :info_abstract: Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application. -include::../attributes.txt[] - [abstract] -- Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application. diff --git a/latest/ug/ml/machine-learning-on-eks.adoc b/latest/ug/ml/machine-learning-on-eks.adoc new file mode 100644 index 00000000..6c1916ec --- /dev/null +++ b/latest/ug/ml/machine-learning-on-eks.adoc @@ -0,0 +1,68 @@ +//!!NODE_ROOT +include::../attributes.txt[] +[.topic] +[[machine-learning-on-eks,machine-learning-on-eks.title]] += Overview of Machine Learning on Amazon EKS +:doctype: book +:sectnums: +:toc: left +:icons: font +:experimental: +:idprefix: +:idseparator: - +:sourcedir: . +:info_doctype: chapter +:info_title: Machine Learning on Amazon EKS Overview +:info_titleabbrev: Machine Learning on EKS +:keywords: Machine Learning, Amazon EKS, Artificial Intelligence +:info_abstract: Learn to manage containerized applications with Amazon EKS + +[abstract] +-- +Complete guide for running Machine Learning applications on Amazon EKS. This includes everything from provisioning infrastructure to choosing and deploying Machine Learning workloads on Amazon EKS. +-- + +[[ml-features,ml-features.title]] + +Machine Learning (ML) is an area of Artificial Intelligence (AI) where machines process large amounts of data to look for patterns and make connections between the data. This can expose new relationships and help predict outcomes that might not have been apparent otherwise. + +For large-scale ML projects, data centers must be able to store large amounts of data, process data quickly, and integrate data from many sources. The platforms running ML applications must be reliable and secure, but also offer resiliency to recover from data center outages and application failures. {aws} Elastic Kubernetes Service (EKS), running in the {aws} cloud, is particularly suited for ML workloads. + +The primary goal of this section of the EKS User Guide is to help you put together the hardware and software component to build platforms to run Machine Learning workloads in an EKS cluster. +We start by explaining the features and services available to you in EKS and the {aws} cloud, then provide you with tutorials to help you work with ML platforms, frameworks, and models. + +=== Advantages of Machine Learning on EKS and the {aws} cloud + +Amazon Elastic Kubernetes Service (EKS) is a powerful, managed Kubernetes platform that has become a cornerstone for deploying and managing AI/ML workloads in the cloud. With its ability to handle complex, resource-intensive tasks, Amazon EKS provides a scalable and flexible foundation for running AI/ML models, making it an ideal choice for organizations aiming to harness the full potential of machine learning. + +Key Advantages of AI/ML Platforms on Amazon EKS include: + +* *Scalability and Flexibility* +Amazon EKS enables organizations to scale AI/ML workloads seamlessly. Whether you're training large language models that require vast amounts of compute power or deploying inference pipelines that need to handle unpredictable traffic patterns, EKS scales up and down efficiently, optimizing resource use and cost. + +* *High Performance with GPUs and Neuron Instances* +Amazon EKS supports a wide range of compute options, including GPUs and {aws}} Neuron instances, which are essential for accelerating AI/ML workloads. This support allows for high-performance training and low-latency inference, ensuring that models run efficiently in production environments. + +* *Integration with AI/ML Tools* +Amazon EKS integrates seamlessly with popular AI/ML tools and frameworks like TensorFlow, PyTorch, and Ray, providing a familiar and robust ecosystem for data scientists and engineers. These integrations enable users to leverage existing tools while benefiting from the scalability and management capabilities of Kubernetes. + +* *Automation and Management* +Kubernetes on Amazon EKS automates many of the operational tasks associated with managing AI/ML workloads. Features like automatic scaling, rolling updates, and self-healing ensure that your applications remain highly available and resilient, reducing the overhead of manual intervention. + +* *Security and Compliance* +Running AI/ML workloads on Amazon EKS provides robust security features, including fine-grained IAM roles, encryption, and network policies, ensuring that sensitive data and models are protected. EKS also adheres to various compliance standards, making it suitable for enterprises with strict regulatory requirements. + +=== Why Choose Amazon EKS for AI/ML? +Amazon EKS offers a comprehensive, managed environment that simplifies the deployment of AI/ML models while providing the performance, scalability, and security needed for production workloads. With its ability to integrate with a variety of AI/ML tools and its support for advanced compute resources, EKS empowers organizations to accelerate their AI/ML initiatives and deliver innovative solutions at scale. + +By choosing Amazon EKS, you gain access to a robust infrastructure that can handle the complexities of modern AI/ML workloads, allowing you to focus on innovation and value creation rather than managing underlying systems. Whether you are deploying simple models or complex AI systems, Amazon EKS provides the tools and capabilities needed to succeed in a competitive and rapidly evolving field. + +=== Start using Machine Learning on EKS + +To begin planning for and using Machine Learning platforms and workloads on EKS on the {aws} cloud, proceed to the <> section. + +include::ml-get-started.adoc[leveloffset=+1] + +include::ml-prepare-for-cluster.adoc[leveloffset=+1] + +include::ml-tutorials.adoc[leveloffset=+1] diff --git a/latest/ug/ml/ml-eks-optimized-ami.adoc b/latest/ug/ml/ml-eks-optimized-ami.adoc new file mode 100644 index 00000000..b3784727 --- /dev/null +++ b/latest/ug/ml/ml-eks-optimized-ami.adoc @@ -0,0 +1,87 @@ +//!!NODE_ROOT
+[.topic] +[[ml-eks-optimized-ami,ml-eks-optimized-ami.title]] += Create nodes with EKS optimized accelerated Amazon Linux AMIs +:info_titleabbrev: Run GPU AMIs + +include::../attributes.txt[] + +The Amazon EKS optimized accelerated Amazon Linux AMI is built on top of the standard Amazon EKS optimized Amazon Linux AMI. For details on these AMIs, see <>. +The following text describes how to enable {aws} Neuron-based workloads. + +.To enable {aws} Neuron (ML accelerator) based workloads +For details on training and inference workloads using [.noloc]`Neuron` in Amazon EKS, see the following references: + +* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the _{aws} [.noloc]`Neuron` Documentation_ +* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training] in {aws} [.noloc]`Neuron` EKS Samples on GitHub +* <> + +The following procedure describes how to run a workload on a GPU based instance with the Amazon EKS optimized accelerated AMI. + +. After your GPU nodes join your cluster, you must apply the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA device plugin for Kubernetes] as a [.noloc]`DaemonSet` on your cluster. Replace [.replaceable]`vX.X.X` with your desired https://github.com/NVIDIA/k8s-device-plugin/releases[NVIDIA/k8s-device-plugin] version before running the following command. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml +---- +. You can verify that your nodes have allocatable GPUs with the following command. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" +---- +. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`tag` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node. ++ +[source,yaml,subs="verbatim,attributes,quotes"] +---- +apiVersion: v1 +kind: Pod +metadata: + name: nvidia-smi +spec: + restartPolicy: OnFailure + containers: + - name: nvidia-smi + image: nvidia/cuda:tag + args: + - "nvidia-smi" + resources: + limits: + nvidia.com/gpu: 1 +---- +. Apply the manifest with the following command. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +kubectl apply -f nvidia-smi.yaml +---- +. After the [.noloc]`Pod` has finished running, view its logs with the following command. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +kubectl logs nvidia-smi +---- ++ +An example output is as follows. ++ +[source,bash,subs="verbatim,attributes,quotes"] +---- +Mon Aug 6 20:23:31 20XX ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI XXX.XX Driver Version: XXX.XX | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | +| N/A 46C P0 47W / 300W | 0MiB / 16160MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ ++-----------------------------------------------------------------------------+ +| Processes: GPU Memory | +| GPU PID Type Process name Usage | +|=============================================================================| +| No running processes found | ++-----------------------------------------------------------------------------+ +---- + + diff --git a/latest/ug/ml/ml-get-started.adoc b/latest/ug/ml/ml-get-started.adoc new file mode 100644 index 00000000..03afe33d --- /dev/null +++ b/latest/ug/ml/ml-get-started.adoc @@ -0,0 +1,51 @@ +//!!NODE_ROOT
+ +[.topic] +[[ml-get-started,ml-get-started.title]] += Get started with ML +:info_doctype: section +:info_title: Get started deploying Machine Learning tools on EKS +:info_titleabbrev: Get started with ML +:info_abstract: Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy them to the {aws} cloud. + +include::../attributes.txt[] + + +[abstract] +-- +Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy ML workloads and EKS clusters to the {aws} cloud. +-- + +To jump into Machine Learning on EKS, start by choosing from these prescriptive patterns to quickly get an EKS cluster and ML software and hardware ready to begin running ML workloads. Most of these patterns are based on Terraform blueprints that are available from the https://awslabs.github.io/data-on-eks/docs/introduction/intro[Data on Amazon EKS] site. Before you begin, here are few things to keep in mind: + +* GPUs or Neuron instances are required to run these procedures. Lack of availability of these resources can cause these procedures to fail during cluster creation or node autoscaling. +* Neuron SDK (Tranium and Inferentia-based instances) can save money and are more available than NVIDIA GPUs. So, when your worloads permit it, we recommend that you consider using Neutron for your Machine Learning workloads (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/[Welcome to {aws} Neuron]). +* Some of the getting started experiences here require that you get data via your own https://huggingface.co/[Hugging Face] account. + +To get started, choose from the following selection of patterns that are designed to get you started setting up infrastructure to run your Machine Learning workloads: + +* *https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub on EKS]*: Explore the https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub blueprint], which showcases Time Slicing and MIG features, as well as multi-tenant configurations with profiles. This is ideal for deploying large-scale JupyterHub platforms on EKS. +* *https://aws.amazon.com/ai/machine-learning/neuron/[Large Language Models on {aws} Neuron and RayServe]*: Use https://aws.amazon.com/ai/machine-learning/neuron/[{aws} Neuron] to run large language models (LLMs) on Amazon EKS and {aws} Trainium and {aws} Inferentia accelerators. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/Neuron/vllm-ray-inf2[Serving LLMs with RayServe and vLLM on {aws} Neuron] for instructions on setting up a platform for making inference requests, with components that include: ++ +** {aws} Neuron SDK toolkit for deep learning +** {aws} Inferentia and Trainium accelerators +** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site) +** RayServe scalable model serving library (see the https://docs.ray.io/en/latest/serve/index.html[Ray Serve: Scalable and Programmable Serving] site) +** Llama-3 language model, using your own https://huggingface.co/[Hugging Face] account. +** Observability with {aws} CloudWatch and Neuron Monitor +** Open WebUI +* *https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Large Language Models on NVIDIA and Triton]*: Deploy multiple large language models (LLMs) on Amazon EKS and NVIDIA GPUs. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Deploying Multiple Large Language Models with NVIDIA Triton Server and vLLM] for instructions for setting up a platform for making inference requests, with components that include: ++ +** NVIDIA Triton Inference Server (see the https://github.com/triton-inference-server/server[Triton Inference Server] GitHub site) +** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site) +** Two language models: mistralai/Mistral-7B-Instruct-v0.2 and meta-llama/Llama-2-7b-chat-hf, using your own https://huggingface.co/[Hugging Face] account. + +=== Continuing with ML on EKS + +Along with choosing from the blueprints described on this page, there are other ways you can proceed through the ML on EKS documentation if you prefer. For example, you can: + +* *Try tutorials for ML on EKS* – Run other end-to-end tutorials for building and running your own Machine Learning models on EKS. See <>. + +To improve your work with ML on EKS, refer to the following: + +* *Prepare for ML* – Learn how to prepare for ML on EKS with features like custom AMIs and GPU reservations. See <>. diff --git a/latest/ug/ml/ml-prepare-for-cluster.adoc b/latest/ug/ml/ml-prepare-for-cluster.adoc new file mode 100644 index 00000000..d5775371 --- /dev/null +++ b/latest/ug/ml/ml-prepare-for-cluster.adoc @@ -0,0 +1,44 @@ +//!!NODE_ROOT
+ +[.topic] +[[ml-prepare-for-cluster,ml-prepare-for-cluster.title]] += Prepare for ML clusters +:info_doctype: section +:info_title: Prepare to create an EKS cluster for Machine Learning +:info_titleabbrev: Prepare for ML +:info_abstract: Learn how to make decisions about CPU, AMIs, and tooling before creating an EKS cluster for ML. + +include::../attributes.txt[] + + +[abstract] +-- +Learn how to make decisions about CPU, AMIs, and tooling before creating an EKS cluster for ML. +-- + +There are ways that you can enhance your Machine Learning on EKS experience. +Following pages in this section will help you: + +* Understand your choices for using ML on EKS and +* Help in preparation of your EKS and ML environment. + +In particular, this will help you: + +* *Choose AMIs*: {aws} offers multiple customized AMIs for running ML workloads on EKS. See <>. +* *Customize AMIs*: You can further modify {aws} custom AMIs to add other software and drivers needed for your particular use cases. See <>. +* *Reserve GPUs*: Because of the demand for GPUs, to ensure that the GPUs you need are available when you need them, you can reserve the GPUs you need in advance. See <>. +* *Add EFA*: Add the Elastic Fabric Adapter to improve network performance for inter-node cluster communications. See <>. +* *Use AWSInferentia workloads*: Create an EKS cluster with Amazon EC2 Inf1 instances. See <>. + +[.topiclist] +[[Topic List]] + +include::ml-eks-optimized-ami.adoc[leveloffset=+1] + +include::capacity-blocks.adoc[leveloffset=+1] + +include::node-taints-managed-node-groups.adoc[leveloffset=+1] + +include::node-efa.adoc[leveloffset=+1] + +include::inferentia-support.adoc[leveloffset=+1] diff --git a/latest/ug/ml/ml-tutorials.adoc b/latest/ug/ml/ml-tutorials.adoc new file mode 100644 index 00000000..302099f8 --- /dev/null +++ b/latest/ug/ml/ml-tutorials.adoc @@ -0,0 +1,74 @@ +//!!NODE_ROOT
+ +[.topic] +[[ml-tutorials,ml-tutorials.title]] += Try tutorials for deploying Machine Learning workloads on EKS +:info_doctype: section +:info_title: Try tutorials for deploying Machine Learning workloads and platforms on EKS +:info_titleabbrev: Try tutorials for ML on EKS +:info_abstract: Learn how to deploy Machine Learning workloads on EKS + +include::../attributes.txt[] + +If you are interested in setting up Machine Learning platforms and frameworks in EKS, explore the tutorials described in this page. +These tutorials cover everything from patterns for making the best use of GPU processors to choosing modeling tools to building frameworks for specialized industries. + +== Build generative AI platforms on EKS + +* https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/[Deploy Generative AI Models on Amazon EKS] +* https://aws.amazon.com/blogs/containers/building-multi-tenant-jupyterhub-platforms-on-amazon-eks/[Building multi-tenant JupyterHub Platforms on Amazon EKS] +* https://aws.amazon.com/blogs/containers/run-spark-rapids-ml-workloads-with-gpus-on-amazon-emr-on-eks/[Run Spark-RAPIDS ML workloads with GPUs on Amazon EMR on EKS] + +== Run specialized generative AI frameworks on EKS + +* https://aws.amazon.com/blogs/hpc/accelerate-drug-discovery-with-nvidia-bionemo-framework-on-amazon-eks/[Accelerate drug discovery with NVIDIA BioNeMo Framework on Amazon EKS] +* https://aws.amazon.com/blogs/containers/host-the-whisper-model-with-streaming-mode-on-amazon-eks-and-ray-serve/[Host the Whisper Model with Streaming Mode on Amazon EKS and Ray Serve] +* https://aws.amazon.com/blogs/machine-learning/accelerate-your-generative-ai-distributed-training-workloads-with-the-nvidia-nemo-framework-on-amazon-eks/[Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS] +* https://aws.amazon.com/blogs/publicsector/virtualizing-satcom-operations-aws/[Virtualizing satellite communication operations with {aws}] +* https://aws.amazon.com/blogs/opensource/running-torchserve-on-amazon-elastic-kubernetes-service/[Running TorchServe on Amazon Elastic Kubernetes Service] + +== Maximize NVIDIA CPU performance for ML on EKS + +* Implement GPU sharing to efficiently use NVIDIA GPUs for your EKS clusters: ++ +https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/[GPU sharing on Amazon EKS with NVIDIA time-slicing and accelerated EC2 instances] + +* Use Multi-Instance GPUs (MIGs) and NIM microservices to run more pods per GPU on your EKS clusters: ++ +https://aws.amazon.com/blogs/containers/maximizing-gpu-utilization-with-nvidias-multi-instance-gpu-mig-on-amazon-eks-running-more-pods-per-gpu-for-enhanced-performance/[Maximizing GPU utilization with NVIDIA’s Multi-Instance GPU (MIG) on Amazon EKS: Running more pods per GPU for enhanced performance] + +* Leverage NVIDIA NIM microservices to optimize inference workloads using optimized microservices to deploy AI models at scale: ++ +https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nims-on-amazon-eks/[Part 1: Deploying generative AI applications with NVIDIA NIMs on Amazon EKS] ++ +https://aws.amazon.com/blogs/hpc/deploying-generative-ai-applications-with-nvidia-nim-microservices-on-amazon-elastic-kubernetes-service-amazon-eks-part-2/[Part 2: Deploying Generative AI Applications with NVIDIA NIM Microservices on Amazon Elastic Kubernetes Service (Amazon EKS)] + +* https://aws.amazon.com/blogs/containers/scaling-a-large-language-model-with-nvidia-nim-on-amazon-eks-with-karpenter/[Scaling a Large Language Model with NVIDIA NIM on Amazon EKS with Karpenter] + + +* https://aws.amazon.com/blogs/machine-learning/build-and-deploy-a-scalable-machine-learning-system-on-kubernetes-with-kubeflow-on-aws/[Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on {aws}] + +== Run video encoding workloads on EKS + +* https://aws.amazon.com/blogs/containers/delivering-video-content-with-fractional-gpus-in-containers-on-amazon-eks/[Delivering video content with fractional GPUs in containers on Amazon EKS] + +== Testimonials for ML on EKS + +* https://aws.amazon.com/blogs/containers/how-h2o-ai-optimized-and-secured-their-ai-ml-infrastructure-with-karpenter-and-bottlerocket/[How H2O.ai optimized and secured their AI/ML infrastructure with Karpenter and Bottlerocket] +* https://aws.amazon.com/blogs/containers/quora-3x-faster-machine-learning-25-lower-costs-with-nvidia-triton-on-amazon-eks/[Quora achieved 3x lower latency and 25% lower Costs by modernizing model serving with Nvidia Triton on Amazon EKS] + +== Monitoring ML workloads + +* https://aws.amazon.com/blogs/mt/monitoring-gpu-workloads-on-amazon-eks-using-aws-managed-open-source-services/[Monitoring GPU workloads on Amazon EKS using {aws} managed open-source services] +* https://aws.amazon.com/blogs/machine-learning/enable-pod-based-gpu-metrics-in-amazon-cloudwatch/[Enable pod-based GPU metrics in Amazon CloudWatch] + +== Announcements for ML on EKS + +* https://aws.amazon.com/blogs/containers/announcing-nvidia-gpu-support-for-bottlerocket-on-amazon-ecs/[Announcing NVIDIA GPU support for Bottlerocket on Amazon ECS] +* https://aws.amazon.com/blogs/containers/bottlerocket-support-for-nvidia-gpus/[Bottlerocket support for NVIDIA GPUs] +* https://aws.amazon.com/blogs/aws/new-ec2-instances-g5-with-nvidia-a10g-tensor-core-gpus/[New – EC2 Instances (G5) with NVIDIA A10G Tensor Core GPUs] +* https://aws.amazon.com/blogs/containers/utilizing-nvidia-multi-instance-gpu-mig-in-amazon-ec2-p4d-instances-on-amazon-elastic-kubernetes-service-eks/[Utilizing NVIDIA Multi-Instance GPU (MIG) in Amazon EC2 P4d Instances on Amazon Elastic Kubernetes Service] +* https://aws.amazon.com/blogs/aws/new-gpu-equipped-ec2-p4-instances-for-machine-learning-hpc/[New – GPU-Equipped EC2 P4 Instances for Machine Learning & HPC] +* https://aws.amazon.com/blogs/machine-learning/amazon-ec2-p5e-instances-are-generally-available/[Amazon EC2 P5e instances are generally available] +* https://aws.amazon.com/blogs/containers/deploying-managed-p4d-instances-in-amazon-elastic-kubernetes-service/[Deploying managed P4d Instances in Amazon Elastic Kubernetes Service with NVIDIA GPUDirectRDMA] +* https://aws.amazon.com/blogs/machine-learning/establishing-an-ai-ml-center-of-excellence/[Establishing an AI/ML center of excellence] diff --git a/latest/ug/workloads/node-efa.adoc b/latest/ug/ml/node-efa.adoc similarity index 98% rename from latest/ug/workloads/node-efa.adoc rename to latest/ug/ml/node-efa.adoc index 933662ee..a77eaeb5 100644 --- a/latest/ug/workloads/node-efa.adoc +++ b/latest/ug/ml/node-efa.adoc @@ -5,9 +5,9 @@ [[node-efa,node-efa.title]] = Run machine learning training on Amazon EKS with [.noloc]`Elastic Fabric Adapter` :info_doctype: section -:info_title: Run machine learning training on Amazon EKS with Elastic Fabric \ - Adapter -:info_titleabbrev: Machine learning training +:info_title: Add Elastic Fabric \ + Adapter to EKS clusters for ML training +:info_titleabbrev: Add EFA to ML clusters :info_abstract: Learn how to integrate Elastic Fabric Adapter (EFA) with Amazon EKS to run machine \ learning training workloads requiring high inter-node communications at scale using \ p4d instances with GPUDirect RDMA and NVIDIA Collective Communications Library \ @@ -61,7 +61,7 @@ An important consideration required for adopting EFA with [.noloc]`Kubernetes` i The following procedure helps you create a node group with a `p4d.24xlarge` backed node group with EFA interfaces and GPUDirect RDMA, and run an example NVIDIA Collective Communications Library (NCCL) test for multi-node NCCL Performance using EFAs. The example can be used a template for distributed deep learning training on Amazon EKS using EFAs. -. Determine which Amazon EC2 instance types that support EFA are available in the {aws} Region that you want to deploy nodes in.Replace [.replaceable]`region-code` with the {aws} Region that you want to deploy your node group in. +. Determine which Amazon EC2 instance types that support EFA are available in the {aws} Region that you want to deploy nodes in. Replace [.replaceable]`region-code` with the {aws} Region that you want to deploy your node group in. + [source,bash,subs="verbatim,attributes,quotes"] ---- diff --git a/latest/ug/nodes/node-taints-managed-node-groups.adoc b/latest/ug/ml/node-taints-managed-node-groups.adoc similarity index 88% rename from latest/ug/nodes/node-taints-managed-node-groups.adoc rename to latest/ug/ml/node-taints-managed-node-groups.adoc index 9c2f96c0..9af3a11d 100644 --- a/latest/ug/nodes/node-taints-managed-node-groups.adoc +++ b/latest/ug/ml/node-taints-managed-node-groups.adoc @@ -3,13 +3,17 @@ include::../attributes.txt[] [.topic] [[node-taints-managed-node-groups,node-taints-managed-node-groups.title]] = Prevent [.noloc]`Pods` from being scheduled on specific nodes -:info_titleabbrev: Node taints +:info_titleabbrev: Taint GPU nodes [abstract] -- -Taints and tolerations work together to ensure that [.noloc]`Pods` aren't scheduled onto inappropriate nodes. +Taints and tolerations work together to ensure that [.noloc]`Pods` aren't scheduled onto inappropriate nodes. This can be particularly useful for nodes running on GPU hardware. -- +Nodes with specialized processors, such as GPUs, can be more expensive to run than nodes running on more standard machines. +For that reason, you may want to protect those nodes from having workloads that don't require special hardware from being deployed to those nodes. +One way to do that is with taints. + Amazon EKS supports configuring [.noloc]`Kubernetes` taints through managed node groups. Taints and tolerations work together to ensure that [.noloc]`Pods` aren't scheduled onto inappropriate nodes. One or more taints can be applied to a node. This marks that the node shouldn't accept any [.noloc]`Pods` that don't tolerate the taints. Tolerations are applied to [.noloc]`Pods` and allow, but don't require, the [.noloc]`Pods` to schedule onto nodes with matching taints. For more information, see https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/[Taints and Tolerations] in the [.noloc]`Kubernetes` documentation. [.noloc]`Kubernetes` node taints can be applied to new and existing managed node groups using the {aws-management-console} or through the Amazon EKS API. diff --git a/latest/ug/nodes/dl1.adoc b/latest/ug/nodes/dl1.adoc deleted file mode 100644 index ba262210..00000000 --- a/latest/ug/nodes/dl1.adoc +++ /dev/null @@ -1,25 +0,0 @@ -//!!NODE_ROOT
-include::../attributes.txt[] -[.topic] -[[dl1,dl1.title]] -= Use Habana Deep Learning ([.noloc]`DL1`) workloads -:info_titleabbrev: Deep learning - -[abstract] --- -Custom Amazon Linux 2 (AL2) AMIs in Amazon EKS can support deep learning workloads at scale through additional configuration and [.noloc]`Kubernetes` add-ons. --- - -Custom Amazon Linux 2 (AL2) AMIs in Amazon EKS can support deep learning workloads at scale through additional configuration and [.noloc]`Kubernetes` add-ons. This document describes the components required to set up a generic [.noloc]`Kubernetes` solution for an on-premise setup or as a baseline in a larger cloud configuration. To support this function, you will have to perform the following steps in your custom environment: - - - -* [.noloc]`SynapaseAI(R) Software` drivers loaded on the system – These are included in the https://github.com/aws-samples/aws-habana-baseami-pipeline[AMIs available on Github]. -* The [.noloc]`Habana` device plugin – A [.noloc]`DaemonSet` that allows you to automatically enable the registration of [.noloc]`Habana` devices in your [.noloc]`Kubernetes` cluster and track device health. -* Helm 3.x -* https://docs.habana.ai/en/latest/Gaudi_Kubernetes/Gaudi_Kubernetes.html#habana-mpi-operator-and-helm-chart-for-kubernetes[Helm chart to install MPI Operator]. -* MPI Operator -. Create and launch a base AMI from AL2, [.noloc]`Ubuntu` 18, or [.noloc]`Ubuntu` 20. -. Follow https://docs.habana.ai/en/latest/Gaudi_Kubernetes/Gaudi_Kubernetes.html[these instructions] to set up the environment for [.noloc]`DL1`. - - diff --git a/latest/ug/nodes/eks-ami-build-scripts.adoc b/latest/ug/nodes/eks-ami-build-scripts.adoc index de1c5195..12e16c0d 100644 --- a/latest/ug/nodes/eks-ami-build-scripts.adoc +++ b/latest/ug/nodes/eks-ami-build-scripts.adoc @@ -23,7 +23,3 @@ Additionally, the [.noloc]`GitHub` repository contains our Amazon EKS node {aws} For more information, see the repositories on [.noloc]`GitHub` at https://github.com/awslabs/amazon-eks-ami. Amazon EKS optimized AL2 contains an optional bootstrap flag to enable the `containerd` runtime. - -include::vt1.adoc[leveloffset=+1] - -include::dl1.adoc[leveloffset=+1] \ No newline at end of file diff --git a/latest/ug/nodes/eks-optimized-ami.adoc b/latest/ug/nodes/eks-optimized-ami.adoc index 5b4251ac..4e9577f9 100644 --- a/latest/ug/nodes/eks-optimized-ami.adoc +++ b/latest/ug/nodes/eks-optimized-ami.adoc @@ -49,8 +49,6 @@ The Amazon EKS optimized accelerated Amazon Linux AMI is built on top of the sta In addition to the standard Amazon EKS optimized AMI configuration, the accelerated AMI includes the following: - - * [.noloc]`NVIDIA` drivers * `nvidia-container-toolkit` * {aws} [.noloc]`Neuron` driver @@ -60,87 +58,13 @@ For a list of the latest components included in the accelerated AMI, see the `am [NOTE] ==== - * The Amazon EKS optimized accelerated AMI only supports GPU and [.noloc]`Inferentia` based instance types. Make sure to specify these instance types in your node {aws} CloudFormation template. By using the Amazon EKS optimized accelerated AMI, you agree to https://s3.amazonaws.com/EULA/NVidiaEULAforAWS.pdf[NVIDIA's Cloud End User License Agreement (EULA)]. -* The Amazon EKS optimized accelerated AMI was previously referred to as the _Amazon EKS optimized AMI with GPU support_. -* Previous versions of the Amazon EKS optimized accelerated AMI installed the `nvidia-docker` repository. The repository is no longer included in Amazon EKS AMI version `v20200529` and later. +* The Amazon EKS optimized accelerated AMI was previously referred to as the _Amazon EKS optimized AMI with GPU support_. +* Previous versions of the Amazon EKS optimized accelerated AMI installed the `nvidia-docker` repository. The repository is no longer included in Amazon EKS AMI version `v20200529` and later. ==== -.To enable {aws} Neuron (ML accelerator) based workloads -For details on training and inference workloads using [.noloc]`Neuron` in Amazon EKS, see the following references: - -* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the _{aws} [.noloc]`Neuron` Documentation_ -* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training] in {aws} [.noloc]`Neuron` EKS Samples on GitHub -* <> - -The following procedure describes how to run a workload on a GPU based instance with the Amazon EKS optimized accelerated AMI. - -. After your GPU nodes join your cluster, you must apply the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA device plugin for Kubernetes] as a [.noloc]`DaemonSet` on your cluster. Replace [.replaceable]`vX.X.X` with your desired https://github.com/NVIDIA/k8s-device-plugin/releases[NVIDIA/k8s-device-plugin] version before running the following command. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml ----- -. You can verify that your nodes have allocatable GPUs with the following command. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" ----- -. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`tag` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node. -+ -[source,yaml,subs="verbatim,attributes,quotes"] ----- -apiVersion: v1 -kind: Pod -metadata: - name: nvidia-smi -spec: - restartPolicy: OnFailure - containers: - - name: nvidia-smi - image: nvidia/cuda:tag - args: - - "nvidia-smi" - resources: - limits: - nvidia.com/gpu: 1 ----- -. Apply the manifest with the following command. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -kubectl apply -f nvidia-smi.yaml ----- -. After the [.noloc]`Pod` has finished running, view its logs with the following command. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -kubectl logs nvidia-smi ----- -+ -An example output is as follows. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -Mon Aug 6 20:23:31 20XX -+-----------------------------------------------------------------------------+ -| NVIDIA-SMI XXX.XX Driver Version: XXX.XX | -|-------------------------------+----------------------+----------------------+ -| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | -|===============================+======================+======================| -| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | -| N/A 46C P0 47W / 300W | 0MiB / 16160MiB | 0% Default | -+-------------------------------+----------------------+----------------------+ -+-----------------------------------------------------------------------------+ -| Processes: GPU Memory | -| GPU PID Type Process name Usage | -|=============================================================================| -| No running processes found | -+-----------------------------------------------------------------------------+ ----- +For details on running workloads on EKS optimized accelerated Amazon Linux AMIs, see <>. [[arm-ami,arm-ami.title]] diff --git a/latest/ug/nodes/managed-node-groups.adoc b/latest/ug/nodes/managed-node-groups.adoc index b6f0c64d..6b378b0f 100644 --- a/latest/ug/nodes/managed-node-groups.adoc +++ b/latest/ug/nodes/managed-node-groups.adoc @@ -19,8 +19,6 @@ include::update-managed-node-group.adoc[leveloffset=+1] include::managed-node-update-behavior.adoc[leveloffset=+1] -include::node-taints-managed-node-groups.adoc[leveloffset=+1] - include::launch-templates.adoc[leveloffset=+1] include::delete-managed-node-group.adoc[leveloffset=+1] diff --git a/latest/ug/nodes/vt1.adoc b/latest/ug/nodes/vt1.adoc deleted file mode 100644 index 97040700..00000000 --- a/latest/ug/nodes/vt1.adoc +++ /dev/null @@ -1,26 +0,0 @@ -//!!NODE_ROOT
-include::../attributes.txt[] -[.topic] -[[vt1,vt1.title]] -= Use hardware-accelerated [.noloc]`VT1` video transcoding -:info_titleabbrev: Video transcoding - -[abstract] --- -Custom Amazon Linux AMIs in Amazon EKS can support the [.noloc]`VT1` video transcoding instance family for Amazon Linux 2 (AL2), --- - -Custom Amazon Linux AMIs in Amazon EKS can support the VT1 video transcoding instance family for Amazon Linux 2 (AL2), [.noloc]`Ubuntu` 18, and [.noloc]`Ubuntu` 20. [.noloc]`VT1` supports the [.noloc]`Xilinx` U30 media transcoding cards with accelerated H.264/AVC and H.265/HEVC codecs. To get the benefit of these accelerated instances, you must follow these steps: - -. Create and launch a base AMI from AL2, [.noloc]`Ubuntu` 18, or [.noloc]`Ubuntu` 20. -. After the based AMI is launched, Install the https://xilinx.github.io/video-sdk/[XRT driver] and runtime on the node. -. <>. -. Install the [.noloc]`Kubernetes` https://github.com/Xilinx/FPGA_as_a_Service/tree/master/k8s-device-plugin[FPGA plugin] on your cluster. -+ -[source,bash,subs="verbatim,attributes,quotes"] ----- -kubectl apply -f fpga-device-plugin.yml ----- - -The plugin will now advertise [.noloc]`Xilinx` U30 devices per node on your Amazon EKS cluster. You can use the [.noloc]`FFMPEG` docker image to run example video transcoding workloads on your Amazon EKS cluster. - diff --git a/latest/ug/nodes/worker.adoc b/latest/ug/nodes/worker.adoc index 40305f41..f9f12331 100644 --- a/latest/ug/nodes/worker.adoc +++ b/latest/ug/nodes/worker.adoc @@ -47,8 +47,6 @@ For more information about nodes from a general [.noloc]`Kubernetes` perspective [.topic] include::launch-workers.adoc[leveloffset=+1] -include::capacity-blocks.adoc[leveloffset=+1] - include::launch-node-bottlerocket.adoc[leveloffset=+1] include::launch-windows-workers.adoc[leveloffset=+1] diff --git a/latest/ug/workloads/eks-workloads.adoc b/latest/ug/workloads/eks-workloads.adoc index 4e45fb20..e6265cb3 100644 --- a/latest/ug/workloads/eks-workloads.adoc +++ b/latest/ug/workloads/eks-workloads.adoc @@ -59,8 +59,3 @@ include::eks-add-ons.adoc[leveloffset=+1] include::image-verification.adoc[leveloffset=+1] - -include::node-efa.adoc[leveloffset=+1] - - -include::inferentia-support.adoc[leveloffset=+1]