Distributed Serving with Multi-GPU LLMs in OpenShift

This repository provides instructions for deploying LLMs with Multi-GPUs in distributed OpenShift / Kubernetes worker nodes.

Overview
Important Disclaimer
Checking the Memory Footprint of the Model
Using Multiple GPUs for Serving an LLM
- Tensor Parallelism with Serving Runtimes
vLLM Tensor Parallelism (TP)
- vLLM TP in Single Worker Node with Multiple GPUs
- vLLM TP in Multiple Worker Nodes with Multiple GPUs
Optimizing Memory Utilization on a Single GPU
Demos
- Single Node - Multiple GPU Demos
- Multi-Node - Multiple GPU Demos
Demo Steps
- Provision the GPU nodes in OpenShift (optional)
- Deploy the Demo Use Cases
  - Deploy the Single Node - Multiple GPU Demos
  - Deploy the Multi-Node - Multiple GPU Demos
Testing the Multi-GPU Demos
Links of Interest

1. Overview

Large LLMs like Llama-3-70b or Falcon 180B may not fit in a single GPU.

If training/serving a model on a single GPU is too slow or if the model’s weights do not fit in a single GPU’s memory, transitioning to a multi-GPU setup may be a viable option.

But serving large language models (LLMs) with multiple GPUs in a distributed environment might be a challenging task.

2. Important Disclaimer

IMPORTANT DISCLAIMER: Read before proceed!

These demos/repository are not supported by OpenShift AI/RHOAI; they rely on upstream projects.
This is prototyping/testing work intended to confirm functionality and determine the necessary requirements.
These features are not available in the RHOAI dashboard. If you want to implement them, you will need to adapt YAML files to fit your use case.

3. Checking the Memory Footprint of the Model

Before deploying a model in a distributed environment, it is important to check the memory footprint of the model.

To begin estimating how much vRAM is required to serve your LLM, we can use these tools:

HF Model Memory Usage
GPU Poor vRAM Calculator
LLM Model VRAM Calculator (only for quantization models)
LLM Explorer to check raw model vRAM size consumption

4. Using Multiple GPUs for serving an LLM

When a model is too big to fit on a single GPU, we can use various techniques to optimize the memory utilization.

Among the different strategies, we can use Tensor Parallelism to distribute the model across multiple GPUs.

4.1 Tensor Parallelism with Serving Runtimes

Tensor parallelism is a technique used to fit large models across multiple GPUs. In Tensor Parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it.

For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication can be achieved by splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs.

These outputs are then transferred from the GPUs and concatenated to obtain the final result.

vLLM supports distributed tensor-parallel inference and serving. Currently, vLLM support Megatron-LM’s tensor parallel algorithm. vLLM manage the distributed runtime with Ray or Python MultiProcessing.
HF-TGI supports distributed tensor-parallel
ODH TGI - Fork from IBM of HF-TGI

IMPORTANT: Check with the AI BU PMs or your account team to ensure that the Serving Runtime you are using supports tensor parallelism.

There are two ways to use Tensor Parallelism:

In a single worker node with Multiple GPUs
Across multiple worker nodes with different GPUs allocated to each node.

5. vLLM Tensor Parallelism (TP)

Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.

5.1 vLLM TP in single Worker Node with Multiple GPUs

To run multi-GPU serving in one single Worker Node (with multiple GPUs), pass in the --tensor-parallel-size argument when starting the server. This argument specifies the number of GPUs to use for tensor parallelism.

For example, to run Mistral 7B with 2 GPUs, use the following command:

  - "--model"
  - mistralai/Mistral-7B-Instruct-v0.2
  - "--download-dir"
  - /models-cache
  - "--dtype"
  - float16
  - "--tensor-parallel-size=2"

5.2 vLLM TP in multiple Worker Nodes with Multiple GPUs

WIP

To scale vLLM beyond a single Worker Node, start a Ray runtime via CLI before running vLLM.

After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.

6. Optimizing Memory Utilization on a Single GPU

We can use Quantization techniques to reduce the memory footprint of the model and try to fit the LLM in one single GPU. But there are several other techniques (like FlashAttention-2) that can be used to reduce the memory footprint of the model.

The quantization will reduce the precision and the accuracy of the model and will add some overhead to the inference time. So it's important to be aware of the that before applying quantization.

Once you have employed these strategies and found them insufficient for your case on a single GPU, consider moving to multiple GPUs.

7. Demos

7.1 Single Node - Multiple GPU Demos

7.2 Multi-Node - Multiple GPU Demos

TBD

8. Demo Steps

8.1 Provision the GPU nodes in OpenShift (optional)

If you have already GPUs installed in your OpenShift cluster, you can skip this step.

Provision the GPU nodes in OpenShift / Kubernetes using a MachineSet

bash bootstrap/gpu-machineset.sh

Follow the instructions in the script to provision the GPU nodes.

### Select the GPU instance type:
1) Tesla T4 Single GPU  4) A10G Multi GPU       7) DL1
2) Tesla T4 Multi GPU   5) A100                 8) L4 Single GPU
3) A10G Single GPU      6) H100                 9) L4 Multi GPU
Please enter your choice: 3
### Enter the AWS region (default: us-west-2): 
### Select the availability zone (az1, az2, az3):
1) az1
2) az2
3) az3
Please enter your choice: 3
### Creating new machineset worker-gpu-g5.2xlarge-us-west-2c.
machineset.machine.openshift.io/worker-gpu-g5.2xlarge-us-west-2c created
--- New machineset worker-gpu-g5.2xlarge-us-west-2c created.

8.2 Deploy the Demo Use Cases

Create the Namespace for the demo:

kubectl create ns multi-gpu-poc

8.2.1 Deploy the Single Node - Multiple GPU Demos

For example if you want to deploy the Granite 7B model on 2xT4 GPUs, run the following command:

kubectl apply -k llm-servers/overlays/granite-7B/

Check the README.md file in each overlay folder for more details on how to deploy the model.

8.2.2 Deploy the Multi-Node - Multiple GPU Demos

TBD

9. Testing the Multi-GPU Demos

Check the Testing Multi-GPU Demos section for more details on how to test the deployed models.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bootstrap		bootstrap
docs		docs
llm-servers		llm-servers
serving-runtimes		serving-runtimes
test-notebooks		test-notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Serving with Multi-GPU LLMs in OpenShift

Table of Contents

1. Overview

2. Important Disclaimer

3. Checking the Memory Footprint of the Model

4. Using Multiple GPUs for serving an LLM

4.1 Tensor Parallelism with Serving Runtimes

5. vLLM Tensor Parallelism (TP)

5.1 vLLM TP in single Worker Node with Multiple GPUs

5.2 vLLM TP in multiple Worker Nodes with Multiple GPUs

6. Optimizing Memory Utilization on a Single GPU

7. Demos

7.1 Single Node - Multiple GPU Demos

7.2 Multi-Node - Multiple GPU Demos

8. Demo Steps

8.1 Provision the GPU nodes in OpenShift (optional)

8.2 Deploy the Demo Use Cases

8.2.1 Deploy the Single Node - Multiple GPU Demos

8.2.2 Deploy the Multi-Node - Multiple GPU Demos

9. Testing the Multi-GPU Demos

10. Links of Interest

About

Releases

Packages

Languages

rh-aiservices-bu/multi-gpu-llms

Folders and files

Latest commit

History

Repository files navigation

Distributed Serving with Multi-GPU LLMs in OpenShift

Table of Contents

1. Overview

2. Important Disclaimer

3. Checking the Memory Footprint of the Model

4. Using Multiple GPUs for serving an LLM

4.1 Tensor Parallelism with Serving Runtimes

5. vLLM Tensor Parallelism (TP)

5.1 vLLM TP in single Worker Node with Multiple GPUs

5.2 vLLM TP in multiple Worker Nodes with Multiple GPUs

6. Optimizing Memory Utilization on a Single GPU

7. Demos

7.1 Single Node - Multiple GPU Demos

7.2 Multi-Node - Multiple GPU Demos

8. Demo Steps

8.1 Provision the GPU nodes in OpenShift (optional)

8.2 Deploy the Demo Use Cases

8.2.1 Deploy the Single Node - Multiple GPU Demos

8.2.2 Deploy the Multi-Node - Multiple GPU Demos

9. Testing the Multi-GPU Demos

10. Links of Interest

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages