Deploy Big LLMs with Multi-Worker and Multi-GPUs

Deploying models with KServe simplifies model serving, but the rapid growth of Large Language Models (LLMs) makes deploying these massive models on a single GPU increasingly challenging. To address this, leveraging multiple GPUs across multiple nodes has become essential. Fortunately, vLLM supports multi-node/multi-GPU deployment using Ray, and with the serving runtime, vllm-multinode-runtime, OpenShift AI provides a solution for multi-node/multi-GPU setups.

This guide details the steps to enable multi-node/multi-GPU deployment with OpenShift AI model serving.

2. Important Disclaimer

IMPORTANT DISCLAIMER: Read before proceed!

These demos/repository are not supported by OpenShift AI/RHOAI; they rely on upstream projects.
This is prototyping/testing work intended to confirm functionality and determine the necessary requirements.
These features are not available in the RHOAI dashboard. If you want to implement them, you will need to adapt YAML files to fit your use case.

3. Tested Scenarios

OpenShift Cluster 4.15 (AWS)
AWS g5.4xlarge instances (NVIDIA A10G - 24GiB vRAM)
RHOAI 2.14
Llama3-8B - AWS g5.4xlarge (x2)
Mixtral-8x7B - AWS g5.4xlarge (x8)

4. Considerations

Deployment Mode: Multi-node functionality is supported only in RawDeployment mode.
Auto-scaling: Not available for multi-node setups. The autoscaler will automatically be set to external.
Persistent Volume Claim (PVC): Required for multi-node configurations, and it must support the ReadWriteMany (RWM) access mode.
Required Operators:
- Node Feature Discovery Operator: Required to detect node features.
- NVIDIA GPU Operator: Required to use GPUs for inference.

5. Demo Guide

5.1 Deploy RHOAI and Prereqs

Export Variables

DEMO_NAMESPACE="demo-multi-node-multi-gpu"
MODEL_NAME="vllm-llama3-8b"
MODEL_TYPE="llama3"

Install NFS Operator

bash utils/nfs-operator.sh

Install RHOAI and other operators

kubectl apply -k 1-rhoai-operators/overlays/

Install RHOAI, NFD, NFS and NVIDIA GPU Instances

kubectl apply -k 2-rhoai-instances/overlays/

5.2 Deploy vLLM Multi-Node prerequisites

Deploy the prerequisites for the PoC including the Model

kubectl apply -k 3-demo-prep/overlays/$MODEL_TYPE

Deploy Custom CRD and vLLM Multi Node Serving Runtime Template

kubectl apply -k 4-demo-deploy-is-sr/overlays
oc process vllm-multinode-runtime-template -n $DEMO_NAMESPACE | kubectl apply -n $DEMO_NAMESPACE -f -

5.3 Check and Validate the Model deployed in Multi-Node with Multi-GPUs

Check the GPU resource status

podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(kubectl get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)

oc -n $DEMO_NAMESPACE wait --for=condition=ready pod/${podName} --timeout=300s

You can check the logs for both the head and worker pods:

Head Node

Worker Node

Check the GPU memory size for both the head and worker pods

echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi

Verify the status of your InferenceService, run the following command:

oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export isvc_url=$(oc get route -n $DEMO_NAMESPACE |grep $MODEL_NAME| awk '{print $2}')

Send a RESTful request to the LLM deployed in Multi-Node Multi-GPU:

curl https://$isvc_url/v1/completions \
   -H "Content-Type: application/json" \
   -d "{
        \"model\": \"$MODEL_NAME\",
        \"prompt\": \"What is the biggest clothes retail company in the world?\",
        \"max_tokens\": 100,
        \"temperature\": 0
    }"

The answer of the LLM will look like this:

You can also check the Ray cluster status with ray status:

6. Notes for Multi-Node Setup

Parallelism Settings:
- TENSOR_PARALLEL_SIZE and PIPELINE_PARALLEL_SIZE cannot be set via environment variables. These must be configured through workerSpec.tensorParallelSize and workerSpec.pipelineParallelSize.
- In a multi-node ServingRuntime, both workerSpec.tensorParallelSize and workerSpec.pipelineParallelSize must be specified.
- The minimum values:
  - workerSpec.tensorParallelSize: 1
  - workerSpec.pipelineParallelSize: 2
Supported GPU Types:
- Allowed GPU types: nvidia.com/gpu (default), intel.com/gpu, amd.com/gpu, and habana.ai/gaudi.
- The GPU type can be specified in InferenceService. However, if the GPU type differs from what is set in the ServingRuntime, both GPU types will be assigned, potentially causing issues.
Autoscaler Configuration: The autoscaler must be configured as external.
Storage Protocol:
- The only supported storage protocol for storageUri is PVC.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
1-rhoai-operators		1-rhoai-operators
2-rhoai-instances		2-rhoai-instances
3-demo-prep		3-demo-prep
4-demo-deploy-is-sr		4-demo-deploy-is-sr
5-demo-deploy-llm		5-demo-deploy-llm
docs		docs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploy Big LLMs with Multi-Worker and Multi-GPUs

Table of Contents

2. Important Disclaimer

3. Tested Scenarios

4. Considerations

5. Demo Guide

5.1 Deploy RHOAI and Prereqs

5.2 Deploy vLLM Multi-Node prerequisites

5.3 Check and Validate the Model deployed in Multi-Node with Multi-GPUs

6. Notes for Multi-Node Setup

About

Releases

Packages

Languages

License

rh-aiservices-bu/multi-node-multi-gpu-poc

Folders and files

Latest commit

History

Repository files navigation

Deploy Big LLMs with Multi-Worker and Multi-GPUs

Table of Contents

2. Important Disclaimer

3. Tested Scenarios

4. Considerations

5. Demo Guide

5.1 Deploy RHOAI and Prereqs

5.2 Deploy vLLM Multi-Node prerequisites

5.3 Check and Validate the Model deployed in Multi-Node with Multi-GPUs

6. Notes for Multi-Node Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages