Deploying models with KServe simplifies model serving, but the rapid growth of Large Language Models (LLMs) makes deploying these massive models on a single GPU increasingly challenging. To address this, leveraging multiple GPUs across multiple nodes has become essential. Fortunately, vLLM supports multi-node/multi-GPU deployment using Ray, and with the serving runtime, vllm-multinode-runtime
, OpenShift AI provides a solution for multi-node/multi-GPU setups.
This guide details the steps to enable multi-node/multi-GPU deployment with OpenShift AI model serving.
- Deploy Big LLMs with Multi-Worker and Multi-GPUs
- Important Disclaimer
- Tested Scenarios
- Considerations
- Demo Guide
- Notes for Multi-Node Setup
IMPORTANT DISCLAIMER: Read before proceed!
- These demos/repository are not supported by OpenShift AI/RHOAI; they rely on upstream projects.
- This is prototyping/testing work intended to confirm functionality and determine the necessary requirements.
- These features are not available in the RHOAI dashboard. If you want to implement them, you will need to adapt YAML files to fit your use case.
-
OpenShift Cluster 4.15 (AWS)
-
AWS g5.4xlarge instances (NVIDIA A10G - 24GiB vRAM)
-
RHOAI 2.14
- Deployment Mode: Multi-node functionality is supported only in
RawDeployment
mode. - Auto-scaling: Not available for multi-node setups. The autoscaler will automatically be set to
external
. - Persistent Volume Claim (PVC): Required for multi-node configurations, and it must support the
ReadWriteMany (RWM)
access mode. - Required Operators:
- Node Feature Discovery Operator: Required to detect node features.
- NVIDIA GPU Operator: Required to use GPUs for inference.
- Export Variables
DEMO_NAMESPACE="demo-multi-node-multi-gpu"
MODEL_NAME="vllm-llama3-8b"
MODEL_TYPE="llama3"
- Install NFS Operator
bash utils/nfs-operator.sh
- Install RHOAI and other operators
kubectl apply -k 1-rhoai-operators/overlays/
- Install RHOAI, NFD, NFS and NVIDIA GPU Instances
kubectl apply -k 2-rhoai-instances/overlays/
- Deploy the prerequisites for the PoC including the Model
kubectl apply -k 3-demo-prep/overlays/$MODEL_TYPE
- Deploy Custom CRD and vLLM Multi Node Serving Runtime Template
kubectl apply -k 4-demo-deploy-is-sr/overlays
oc process vllm-multinode-runtime-template -n $DEMO_NAMESPACE | kubectl apply -n $DEMO_NAMESPACE -f -
- Check the GPU resource status
podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(kubectl get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)
oc -n $DEMO_NAMESPACE wait --for=condition=ready pod/${podName} --timeout=300s
- You can check the logs for both the head and worker pods:
- Head Node
- Worker Node
- Check the GPU memory size for both the head and worker pods
echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi
- Verify the status of your InferenceService, run the following command:
oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export isvc_url=$(oc get route -n $DEMO_NAMESPACE |grep $MODEL_NAME| awk '{print $2}')
- Send a RESTful request to the LLM deployed in Multi-Node Multi-GPU:
curl https://$isvc_url/v1/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL_NAME\",
\"prompt\": \"What is the biggest clothes retail company in the world?\",
\"max_tokens\": 100,
\"temperature\": 0
}"
- The answer of the LLM will look like this:
- You can also check the Ray cluster status with
ray status
:
-
Parallelism Settings:
TENSOR_PARALLEL_SIZE
andPIPELINE_PARALLEL_SIZE
cannot be set via environment variables. These must be configured throughworkerSpec.tensorParallelSize
andworkerSpec.pipelineParallelSize
.- In a multi-node ServingRuntime, both
workerSpec.tensorParallelSize
andworkerSpec.pipelineParallelSize
must be specified. - The minimum values:
workerSpec.tensorParallelSize
: 1workerSpec.pipelineParallelSize
: 2
-
Supported GPU Types:
- Allowed GPU types:
nvidia.com/gpu
(default),intel.com/gpu
,amd.com/gpu
, andhabana.ai/gaudi
. - The GPU type can be specified in
InferenceService
. However, if the GPU type differs from what is set in theServingRuntime
, both GPU types will be assigned, potentially causing issues.
- Allowed GPU types:
-
Autoscaler Configuration: The autoscaler must be configured as
external
. -
Storage Protocol:
- The only supported storage protocol for
storageUri
isPVC
.
- The only supported storage protocol for