Llama2-13B on 2xA10G GPUs

This demo shows how to deploy the Llama2 model on 2xA10G GPUs.

Usage

Deploy the Llama2 13B model on 2xA10G GPUs:

kubectl apply -k llm-servers/overlays/llama2-13B

Remember to add your HUGGING_FACE_HUB_TOKEN into the Environment Variables to be able to download the model from the Hugging Face Hub.
Check that the LLM is running properly:

kubectl get pod -n multi-gpu-poc
NAME                   READY   STATUS    RESTARTS   AGE
llm1-f687846b9-68bvq   1/1     Running   0          2m1s

Check the logs of the Pod LLM:

kubectl logs -n multi-gpu-poc -l app=llm1

The output should be similar to:

Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

Check the NVIDIA GPU consumption:

POD_NAME=$(kubectl get pod -n nvidia-gpu-operator -l app=nvidia-device-plugin-daemonset -o jsonpath="{.items[0].metadata.name}")
kubectl exec -n nvidia-gpu-operator $POD_NAME -- nvidia-smi

The output should be similar to:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    On  |   00000000:00:1B.0 Off |                    0 |
|  0%   20C    P8             23W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    On  |   00000000:00:1C.0 Off |                    0 |
|  0%   20C    P8             21W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    On  |   00000000:00:1D.0 Off |                    0 |
|  0%   27C    P0             67W /  300W |   20596MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
|  0%   25C    P0             66W /  300W |   20594MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Create a Workbench and clone the repo. Execute the llm_rest_requests.ipynb notebook to query the LLM model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Llama2-13B on 2xA10G GPUs

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Llama2-13B on 2xA10G GPUs

Usage