Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue getting Llama3 8b running on GKE #43

Open
francescov1 opened this issue May 24, 2024 · 22 comments
Open

Issue getting Llama3 8b running on GKE #43

francescov1 opened this issue May 24, 2024 · 22 comments

Comments

@francescov1
Copy link

francescov1 commented May 24, 2024

I'm trying to deploy Llama3 8b on GKE using optimum but running into some troubles.

Following instructions here: https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference. I built the docker image using the make command mentioned.

The server will start booting up, but gets stuck at "Warming up model". See logs below:

│ 2024-05-24T17:26:26.309789Z  INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_c │
│ 2024-05-24T17:26:26.309895Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"                                                                                                                                                │
│ 2024-05-24T17:26:26.400493Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]                                                                                                                                     │
│ 2024-05-24T17:26:26.400639Z  INFO download: text_generation_launcher: Starting download process.                                                                                                                                               │
│ 2024-05-24T17:26:26.475982Z  WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.                                                                                                                         │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:51.727997Z  INFO download: text_generation_launcher: Successfully downloaded weights.                                                                                                                                         │
│ 2024-05-24T17:26:51.728345Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0                                                                                                                                               │
│ 2024-05-24T17:26:54.273164Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0                                                                                                                             │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:54.332635Z  INFO shard-manager: text_generation_launcher: Shard ready in 2.603384915s rank=0                                                                                                                                  │
│ 2024-05-24T17:26:54.431655Z  INFO text_generation_launcher: Starting Webserver                                                                                                                                                                 │
│ 2024-05-24T17:26:54.453486Z  INFO text_generation_router: router/src/main.rs:185: Using the Hugging Face API                                                                                                                                   │
│ 2024-05-24T17:26:54.453528Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"                                                     │
│ 2024-05-24T17:26:54.739323Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_151|>' w

... (lots more tokenizer warnings, same as the ones above and below)

| 2024-05-24T17:26:54.739610Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_250|>' w │
│ 2024-05-24T17:26:54.866449Z  INFO text_generation_router: router/src/main.rs:471: Serving revision 62bd457b6fe961a42a631306577e622c83876cb6 of model meta-llama/Meta-Llama-3-8B                                                                │
│ 2024-05-24T17:26:54.866479Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Llama)                                                                                                                                     │
│ 2024-05-24T17:26:54.866493Z  INFO text_generation_router: router/src/main.rs:265: Using the Hugging Face API to retrieve tokenizer config                                                                                                      │
│ 2024-05-24T17:28:23.784610Z  INFO text_generation_router: router/src/main.rs:314: Warming up model         

Here's my config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optimum-tpu-llama3-8b-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: optimum-tpu-llama3-8b-server
  template:
    metadata:
      labels:
        app: optimum-tpu-llama3-8b-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      hostIPC: true
      containers:
        - name: optimum-tpu-llama3-8b-server
          image: us-central1-docker.pkg.dev/project-lighthouse-403916/tpus/optimum-tpu:latest
          securityContext:
            privileged: true
          args:
            - "--model-id=meta-llama/Meta-Llama-3-8B"
            - "--max-concurrent-requests=1"
            - "--max-input-length=512"
            - "--max-total-tokens=1024"
            - "--max-batch-prefill-tokens=512"
            - "--max-batch-total-tokens=1024"
          env:
            - name: HF_TOKEN
              value: <token>
            - name: HUGGING_FACE_HUB_TOKEN
              value: <token>
            - name: HF_BATCH_SIZE
              value: "1"
            - name: HF_SEQUENCE_LENGTH
              value: "1024"
          ports:
            - containerPort: 80
          volumeMounts:
            - name: data-volume
              mountPath: /data
          resources:
            requests:
              google.com/tpu: 8
            limits:
              google.com/tpu: 8
      volumes:
        - name: data-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: optimum-tpu-llama3-8b-svc
spec:
  selector:
    app: optimum-tpu-llama3-8b-server
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

Any ideas?

@tengomucho
Copy link
Collaborator

Hi Francesco,
Sorry we didn't have the chance to answer earlier... we'll be looking at this and get back to you soon!

@carlesoctav
Copy link

any updates?

@tengomucho
Copy link
Collaborator

I just re-tried this with llama3-8b and it worked fine, but I tested with a lower number of input length and total tokens. With these settinss the server takes ~15s for warmup. Can you retry this, with --max-input-length 32 --max-total-tokens 64?

@francescov1
Copy link
Author

@tengomucho Unfortunately that didn't work. I used the same manifests as above with the changes you mentioned. I also rebuilt the docker image with the latest changes from main.

What TPU are you running on? Is it possible that the v5e node is not big enough, and its unable to use multiple nodes? I can try on a v5p if that's better

@tengomucho
Copy link
Collaborator

I tried on a v5e-litepod8. The only difference I would say is that I did not use GKE, I used the docker container generated by make tpu-tgi as explained here.

@francescov1
Copy link
Author

hmm I don't see why my K8s config would be any different to that.

Is there a prebuilt public Docker image I can test out?

@tengomucho
Copy link
Collaborator

Let me cook one for you, I'll do it on Monday and I'll get back to you.

@rick-c-goog
Copy link

any update on this, I had the same issue with GKE, none of huggingface model works( gemma-2b, mistral, llama etc). No error in logs either, just hang with Info: Warming up model for gemma,

For Misrtral a little bit different:
2024-06-23T00:48:10.071293Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:10.199181Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-23T00:48:10.199294Z INFO download: text_generation_launcher: Starting download process.
2024-06-23T00:48:10.272564Z WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.
2024-06-23T00:48:56.746082Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-23T00:48:56.791824Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-23T00:48:59.480818Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-23T00:48:59.495306Z INFO shard-manager: text_generation_launcher: Shard ready in 2.702693453s rank=0
2024-06-23T00:48:59.548993Z INFO text_generation_launcher: Starting Webserver
2024-06-23T00:48:59.554356Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API
2024-06-23T00:48:59.554399Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:59.727654Z WARN text_generation_router: router/src/main.rs:233: Could not retrieve model info from the Hugging Face hub.
2024-06-23T00:48:59.770889Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Mistral)
2024-06-23T00:48:59.770904Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model mistralai/Mistral-7B-v0.3

@rick-c-goog
Copy link

At the same time, I was able to try the following example test inside GKE POD created.
https://github.com/huggingface/optimum-tpu/blob/main/examples/text-generation/generation.py

@rick-c-goog
Copy link

@tengomucho, any comment on optimum-tpu on GKE issues or potentially public image?

@tengomucho
Copy link
Collaborator

Hey, sorry it took me longer to get this done, but you should be able to test this TGI image huggingface/optimum-tpu:v0.1.1-tgi.

@rick-c-goog
Copy link

Thank you, @tengomucho, got stuck/hang on same step on Warming up:
2024-06-25 11:12:01.541 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541489Z}
2024-06-25 11:12:01.541 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541603Z}
2024-06-25 11:12:01.628 EDT
{fields: {…}, level: WARN, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.628394Z}
2024-06-25 11:12:12.752 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752135Z}
2024-06-25 11:12:12.752 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752408Z}
2024-06-25 11:12:15.687 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.687254Z}
2024-06-25 11:12:15.756 EDT
{fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.756244Z}
2024-06-25 11:12:15.855 EDT
{fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.855187Z}
2024-06-25 11:12:15.861 EDT
Using the Hugging Face API
2024-06-25 11:12:15.862 EDT
Token file not found "/root/.cache/huggingface/token"
2024-06-25 11:12:16.568 EDT
Could not retrieve model info from the Hugging Face hub.
2024-06-25 11:12:16.585 EDT
Using config Some(Gemma)
2024-06-25 11:12:16.585 EDT
Using the Hugging Face API to retrieve tokenizer config
2024-06-25 11:12:16.587 EDT
no pipeline tag found for model google/gemma-2b-it
2024-06-25 11:13:03.877 EDT
Warming up model

@tengomucho
Copy link
Collaborator

tengomucho commented Jun 25, 2024

Umh strange, I just tested it and it worked fine. I tested with this command line BTW:

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                ghcr.io/huggingface/optimum-tpu:v0.1.1-tgi \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

And it took ~12s to warm up:

2024-06-25T15:56:14.798018Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-06-25T15:57:47.220655Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-06-25T15:57:54.872585Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64

@rick-c-goog
Copy link

I believe it is GKE specific,

@francescov1
Copy link
Author

@tengomucho Im seeing the same thing. I retried my deployment manifest I pasted above but with the image huggingface/optimum-tpu:v0.1.1-tgi and still getting the same behavior

@liurupeng
Copy link

liurupeng commented Jun 26, 2024

this one works for me:

kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      volumes:
        - name: data-volume
          emptyDir: {}
      containers:
      - name: tgi-tpu
        image: {optimum-tpu-image}
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            value: {your_token}
          - name: HUGGING_FACE_HUB_TOKEN
            value: {your_token}
        ports:
        - containerPort: 80
        volumeMounts:
            - name: data-volume
              mountPath: /data
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080  
      targetPort: 80  ```

@rick-c-goog
Copy link

thanks, @liurupeng,
I got the logs as following:

2024-06-26 22:43:50.501 EDT
�[2m2024-06-27T02:43:50.500866Z�[0m �[32m INFO�[0m �[1mshard-manager�[0m: �[2mtext_generation_launcher�[0m�[2m:�[0m Shard ready in 2.703506822s �[2m�[3mrank�[0m�[2m=�[0m0�[0m
2024-06-26 22:43:50.599 EDT
�[2m2024-06-27T02:43:50.599561Z�[0m �[32m INFO�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Starting Webserver
2024-06-26 22:43:50.611 EDT
�[2m2024-06-27T02:43:50.611767Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m185:�[0m Using the Hugging Face API
2024-06-26 22:43:50.611 EDT
�[2m2024-06-27T02:43:50.611800Z�[0m �[32m INFO�[0m �[2mhf_hub�[0m�[2m:�[0m �[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs�[0m�[2m:�[0m�[2m55:�[0m Token file not found "/root/.cache/huggingface/token"
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329230Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m471:�[0m Serving revision 2ac59a5d7bf4e1425010f0d457dde7d146658953 of model google/gemma-2b
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329250Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m253:�[0m Using config Some(Gemma)
2024-06-26 22:43:51.329 EDT
�[2m2024-06-27T02:43:51.329254Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m265:�[0m Using the Hugging Face API to retrieve tokenizer config
2024-06-26 22:44:48.963 EDT
�[2m2024-06-27T02:44:48.962935Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m314:�[0m Warming up model
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038381Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m351:�[0m Setting max batch total tokens to 64
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038396Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m352:�[0m Connected
2024-06-26 22:44:55.038 EDT
�[2m2024-06-27T02:44:55.038401Z�[0m �[33m WARN�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m366:�[0m Invalid hostname, defaulting to 0.0.0.0

So, I assume the TGI model should be up and running, but the curl validation command throws connection refused error( I tried both container port 80 or 8000):
kubectl run -it busybox --image radial/busyboxplus:curl
If you don't see a command prompt, try pressing enter.
[ root@busybox:/ ]$ curl 34.118.229.124:8080/generate \

-X POST
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'
-H 'Content-Type: application/json'
curl: (7) Failed to connect to 34.118.229.124 port 8080: Connection refused
[ root@busybox:/ ]$

Did you try the curl connection to validate?

@liurupeng
Copy link

@rick-c-goog I ran the below command:

kubectl port-forward svc/service 8080:8080

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

@rick-c-goog
Copy link

Thanks, @liurupeng , the port-forward curl to 127.0.0.1 working, then busybox curl to service cluster IP afterwards

@Bihan
Copy link

Bihan commented Jul 1, 2024

@tengomucho I am testing optimum-tpu with v2-8 and getting similar issues as discussed above. Does optimum-tpu only supports v5e-litepod?

@tengomucho
Copy link
Collaborator

@Bihan For now we have only tested v5e configurations.

@Bihan
Copy link

Bihan commented Jul 1, 2024

@Bihan For now we have only tested v5e configurations.

@tengomucho Thank you for a quick a reply. Do you think testing with v2-8 or v3-8 would require a major modification?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants