This demo shows how to deploy the Llama2 model on 2xA10G GPUs.
- Deploy the Llama2 13B model on 2xA10G GPUs:
kubectl apply -k llm-servers/overlays/llama2-13B
-
Remember to add your HUGGING_FACE_HUB_TOKEN into the Environment Variables to be able to download the model from the Hugging Face Hub.
-
Check that the LLM is running properly:
kubectl get pod -n multi-gpu-poc
NAME READY STATUS RESTARTS AGE
llm1-f687846b9-68bvq 1/1 Running 0 2m1s
- Check the logs of the Pod LLM:
kubectl logs -n multi-gpu-poc -l app=llm1
The output should be similar to:
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
- Check the NVIDIA GPU consumption:
POD_NAME=$(kubectl get pod -n nvidia-gpu-operator -l app=nvidia-device-plugin-daemonset -o jsonpath="{.items[0].metadata.name}")
kubectl exec -n nvidia-gpu-operator $POD_NAME -- nvidia-smi
- The output should be similar to:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 20C P8 23W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 20C P8 21W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 27C P0 67W / 300W | 20596MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 25C P0 66W / 300W | 20594MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
- Create a Workbench and clone the repo. Execute the llm_rest_requests.ipynb notebook to query the LLM model.