This repository provides instructions for deploying LLMs with Multi-GPUs in distributed OpenShift / Kubernetes worker nodes.
- Overview
- Important Disclaimer
- Checking the Memory Footprint of the Model
- Using Multiple GPUs for Serving an LLM
- vLLM Tensor Parallelism (TP)
- Optimizing Memory Utilization on a Single GPU
- Demos
- Demo Steps
- Testing the Multi-GPU Demos
- Links of Interest
Large LLMs like Llama-3-70b or Falcon 180B may not fit in a single GPU.
If training/serving a model on a single GPU is too slow or if the model’s weights do not fit in a single GPU’s memory, transitioning to a multi-GPU setup may be a viable option.
But serving large language models (LLMs) with multiple GPUs in a distributed environment might be a challenging task.
IMPORTANT DISCLAIMER: Read before proceed!
- These demos/repository are not supported by OpenShift AI/RHOAI; they rely on upstream projects.
- This is prototyping/testing work intended to confirm functionality and determine the necessary requirements.
- These features are not available in the RHOAI dashboard. If you want to implement them, you will need to adapt YAML files to fit your use case.
Before deploying a model in a distributed environment, it is important to check the memory footprint of the model.
To begin estimating how much vRAM is required to serve your LLM, we can use these tools:
- HF Model Memory Usage
- GPU Poor vRAM Calculator
- LLM Model VRAM Calculator (only for quantization models)
- LLM Explorer to check raw model vRAM size consumption
When a model is too big to fit on a single GPU, we can use various techniques to optimize the memory utilization.
Among the different strategies, we can use Tensor Parallelism to distribute the model across multiple GPUs.
Tensor parallelism is a technique used to fit large models across multiple GPUs. In Tensor Parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it.
For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication can be achieved by splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs.
These outputs are then transferred from the GPUs and concatenated to obtain the final result.
-
vLLM supports distributed tensor-parallel inference and serving. Currently, vLLM support Megatron-LM’s tensor parallel algorithm. vLLM manage the distributed runtime with Ray or Python MultiProcessing.
IMPORTANT: Check with the AI BU PMs or your account team to ensure that the Serving Runtime you are using supports tensor parallelism.
There are two ways to use Tensor Parallelism:
- In a single worker node with Multiple GPUs
- Across multiple worker nodes with different GPUs allocated to each node.
Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
To run multi-GPU serving in one single Worker Node (with multiple GPUs), pass in the --tensor-parallel-size argument when starting the server. This argument specifies the number of GPUs to use for tensor parallelism.
- For example, to run Mistral 7B with 2 GPUs, use the following command:
- "--model"
- mistralai/Mistral-7B-Instruct-v0.2
- "--download-dir"
- /models-cache
- "--dtype"
- float16
- "--tensor-parallel-size=2"
WIP
To scale vLLM beyond a single Worker Node, start a Ray runtime via CLI before running vLLM.
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting tensor_parallel_size to the number of GPUs to be the total number of GPUs across all machines.
We can use Quantization techniques to reduce the memory footprint of the model and try to fit the LLM in one single GPU. But there are several other techniques (like FlashAttention-2) that can be used to reduce the memory footprint of the model.
The quantization will reduce the precision and the accuracy of the model and will add some overhead to the inference time. So it's important to be aware of the that before applying quantization.
Once you have employed these strategies and found them insufficient for your case on a single GPU, consider moving to multiple GPUs.
- Running Granite 8B on 2xT4 GPUs
- Running Mistral 7B on 2xT4 GPUs
- Running Llama3 7B on 2xT4 GPUs
- Running Falcon 40B on 8xA10G GPUs
- Running Mixtral 8x7B on 8xA10G GPUs
- Running Llama2 13B on 2xA10G GPUs
TBD
If you have already GPUs installed in your OpenShift cluster, you can skip this step.
- Provision the GPU nodes in OpenShift / Kubernetes using a MachineSet
bash bootstrap/gpu-machineset.sh
- Follow the instructions in the script to provision the GPU nodes.
### Select the GPU instance type:
1) Tesla T4 Single GPU 4) A10G Multi GPU 7) DL1
2) Tesla T4 Multi GPU 5) A100 8) L4 Single GPU
3) A10G Single GPU 6) H100 9) L4 Multi GPU
Please enter your choice: 3
### Enter the AWS region (default: us-west-2):
### Select the availability zone (az1, az2, az3):
1) az1
2) az2
3) az3
Please enter your choice: 3
### Creating new machineset worker-gpu-g5.2xlarge-us-west-2c.
machineset.machine.openshift.io/worker-gpu-g5.2xlarge-us-west-2c created
--- New machineset worker-gpu-g5.2xlarge-us-west-2c created.
- Create the Namespace for the demo:
kubectl create ns multi-gpu-poc
- For example if you want to deploy the Granite 7B model on 2xT4 GPUs, run the following command:
kubectl apply -k llm-servers/overlays/granite-7B/
Check the README.md file in each overlay folder for more details on how to deploy the model.
TBD
- Check the Testing Multi-GPU Demos section for more details on how to test the deployed models.