[TPU] Implement prefix caching for TPUs #10307

WoosukKwon · 2024-11-13T21:40:13Z

No description provided.

github-actions · 2024-11-13T21:40:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <[email protected]>

robertgshaw2-neuralmagic · 2024-11-13T22:41:06Z

Nice work!

vanbasten23 · 2024-11-14T19:21:42Z

vllm/attention/backends/pallas.py

+                output = output.permute(0, 2, 1, 3)
+            else:
+                # Prefill with paged KV cache.
+                # TODO(woosuk): Tune the below knobs.


Thanks Woosuk for writing the PR.

I'm benchmarking the kernel so likely I'll have some recommended num_kv_pages_per_compute_block/num_queries_per_compute_block to share.

Also, the revised paged attention kernel is in torch_xla nightly. Could you try again? I pulled your PR and it seems it needs additional work to get the effective_q_lens and plumb it to the kernel.

cc: @WoosukKwon

@vanbasten23 Is the fixed kernel available in today's nightly?

@vanbasten23 After the kernel fix, the model generates correct outputs with prefix caching 🎉

Awesome. Thanks for confirming!

Signed-off-by: Woosuk Kwon <[email protected]>

vanbasten23 · 2024-11-15T18:36:18Z

examples/offline_inference_tpu.py

 outputs = llm.generate(prompts, sampling_params)
-for output, answer in zip(outputs, answers):
+for output in outputs:


I wonder if you need a test for the prefix caching.

vanbasten23 · 2024-11-15T19:28:19Z

Btw, which command did you use run examples/offline_inference_tpu.py. I used $ python vllm/examples/offline_inference_tpu.py but it fails. Do you need to use a model other than "google/gemma-2b"?

robertgshaw2-neuralmagic · 2024-11-16T04:35:58Z

vllm/attention/backends/pallas.py

+                num_kv_pages_per_compute_block = 16
+                num_queries_per_compute_block = 16
+                assert seq_len % num_queries_per_compute_block == 0
+                output = torch.ops.xla.multi_queries_paged_attention(


@vanbasten23 - does this new kernel have the same SMEM requirements as the original paged_attention where the entire block table is stored in SMEM?

E.g. for the decoding run (see below), we split the batch dimension into smaller chunks and run the kernel multiple times

[TPU] Implement prefix caching for TPUs

5d0b92c

mergify bot added the ci/build label Nov 13, 2024

WoosukKwon added the tpu Related to Google TPUs label Nov 13, 2024

fix

1aaff0c

Signed-off-by: Woosuk Kwon <[email protected]>

vanbasten23 reviewed Nov 14, 2024

View reviewed changes

WoosukKwon added 2 commits November 14, 2024 21:50

Merge branch 'main' into tpu-prefix-caching

b1427f6

New kernel

a7ad695

Signed-off-by: Woosuk Kwon <[email protected]>

vanbasten23 reviewed Nov 15, 2024

View reviewed changes

vanbasten23 approved these changes Nov 15, 2024

View reviewed changes

robertgshaw2-neuralmagic reviewed Nov 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Implement prefix caching for TPUs #10307

[TPU] Implement prefix caching for TPUs #10307

WoosukKwon commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

robertgshaw2-neuralmagic commented Nov 13, 2024 •

edited

Loading

vanbasten23 Nov 14, 2024

WoosukKwon Nov 14, 2024

WoosukKwon Nov 14, 2024

vanbasten23 Nov 14, 2024

vanbasten23 Nov 15, 2024

vanbasten23 commented Nov 15, 2024

robertgshaw2-neuralmagic Nov 16, 2024

[TPU] Implement prefix caching for TPUs #10307

Are you sure you want to change the base?

[TPU] Implement prefix caching for TPUs #10307

Conversation

WoosukKwon commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

robertgshaw2-neuralmagic commented Nov 13, 2024 • edited Loading

vanbasten23 Nov 14, 2024

Choose a reason for hiding this comment

WoosukKwon Nov 14, 2024

Choose a reason for hiding this comment

WoosukKwon Nov 14, 2024

Choose a reason for hiding this comment

vanbasten23 Nov 14, 2024

Choose a reason for hiding this comment

vanbasten23 Nov 15, 2024

Choose a reason for hiding this comment

vanbasten23 commented Nov 15, 2024

robertgshaw2-neuralmagic Nov 16, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Nov 13, 2024 •

edited

Loading