Skip to content

rh-aiservices-bu/multi-node-multi-gpu-poc

Repository files navigation

Deploy Big LLMs with Multi-Worker and Multi-GPUs

Deploying models with KServe simplifies model serving, but the rapid growth of Large Language Models (LLMs) makes deploying these massive models on a single GPU increasingly challenging. To address this, leveraging multiple GPUs across multiple nodes has become essential. Fortunately, vLLM supports multi-node/multi-GPU deployment using Ray, and with the serving runtime, vllm-multinode-runtime, OpenShift AI provides a solution for multi-node/multi-GPU setups.

This guide details the steps to enable multi-node/multi-GPU deployment with OpenShift AI model serving.

Table of Contents

2. Important Disclaimer

IMPORTANT DISCLAIMER: Read before proceed!

  • These demos/repository are not supported by OpenShift AI/RHOAI; they rely on upstream projects.
  • This is prototyping/testing work intended to confirm functionality and determine the necessary requirements.
  • These features are not available in the RHOAI dashboard. If you want to implement them, you will need to adapt YAML files to fit your use case.

3. Tested Scenarios

4. Considerations

  1. Deployment Mode: Multi-node functionality is supported only in RawDeployment mode.
  2. Auto-scaling: Not available for multi-node setups. The autoscaler will automatically be set to external.
  3. Persistent Volume Claim (PVC): Required for multi-node configurations, and it must support the ReadWriteMany (RWM) access mode.
  4. Required Operators:
    • Node Feature Discovery Operator: Required to detect node features.
    • NVIDIA GPU Operator: Required to use GPUs for inference.

5. Demo Guide

5.1 Deploy RHOAI and Prereqs

  • Export Variables
DEMO_NAMESPACE="demo-multi-node-multi-gpu"
MODEL_NAME="vllm-llama3-8b"
MODEL_TYPE="llama3"
  • Install NFS Operator
bash utils/nfs-operator.sh
  • Install RHOAI and other operators
kubectl apply -k 1-rhoai-operators/overlays/
  • Install RHOAI, NFD, NFS and NVIDIA GPU Instances
kubectl apply -k 2-rhoai-instances/overlays/

5.2 Deploy vLLM Multi-Node prerequisites

  • Deploy the prerequisites for the PoC including the Model
kubectl apply -k 3-demo-prep/overlays/$MODEL_TYPE
  • Deploy Custom CRD and vLLM Multi Node Serving Runtime Template
kubectl apply -k 4-demo-deploy-is-sr/overlays
oc process vllm-multinode-runtime-template -n $DEMO_NAMESPACE | kubectl apply -n $DEMO_NAMESPACE -f -  

5.3 Check and Validate the Model deployed in Multi-Node with Multi-GPUs

  • Check the GPU resource status
podName=$(oc get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor --no-headers|cut -d' ' -f1)
workerPodName=$(kubectl get pod -n $DEMO_NAMESPACE -l app=isvc.$MODEL_NAME-predictor-worker --no-headers|cut -d' ' -f1)

oc -n $DEMO_NAMESPACE wait --for=condition=ready pod/${podName} --timeout=300s
  • You can check the logs for both the head and worker pods:
  • Head Node

head pod

head pod

  • Worker Node

worker pod

  • Check the GPU memory size for both the head and worker pods
echo "### HEAD NODE GPU Memory Size"
kubectl exec $podName -- nvidia-smi
echo "### Worker NODE GPU Memory Size"
kubectl exec $workerPodName -- nvidia-smi
  • Verify the status of your InferenceService, run the following command:
oc wait --for=condition=ready pod/${podName} -n $DEMO_NAMESPACE --timeout=300s
export isvc_url=$(oc get route -n $DEMO_NAMESPACE |grep $MODEL_NAME| awk '{print $2}')
  • Send a RESTful request to the LLM deployed in Multi-Node Multi-GPU:
curl https://$isvc_url/v1/completions \
   -H "Content-Type: application/json" \
   -d "{
        \"model\": \"$MODEL_NAME\",
        \"prompt\": \"What is the biggest clothes retail company in the world?\",
        \"max_tokens\": 100,
        \"temperature\": 0
    }"
  • The answer of the LLM will look like this:

LLM Answer

  • You can also check the Ray cluster status with ray status:

Ray Status

6. Notes for Multi-Node Setup

  1. Parallelism Settings:

    • TENSOR_PARALLEL_SIZE and PIPELINE_PARALLEL_SIZE cannot be set via environment variables. These must be configured through workerSpec.tensorParallelSize and workerSpec.pipelineParallelSize.
    • In a multi-node ServingRuntime, both workerSpec.tensorParallelSize and workerSpec.pipelineParallelSize must be specified.
    • The minimum values:
      • workerSpec.tensorParallelSize: 1
      • workerSpec.pipelineParallelSize: 2
  2. Supported GPU Types:

    • Allowed GPU types: nvidia.com/gpu (default), intel.com/gpu, amd.com/gpu, and habana.ai/gaudi.
    • The GPU type can be specified in InferenceService. However, if the GPU type differs from what is set in the ServingRuntime, both GPU types will be assigned, potentially causing issues.
  3. Autoscaler Configuration: The autoscaler must be configured as external.

  4. Storage Protocol:

    • The only supported storage protocol for storageUri is PVC.

About

Repo for PoC multi-node with multi-gpu

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages