Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KV cache error in wildguard.py #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

comfzy
Copy link

@comfzy comfzy commented Jul 1, 2024

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (30448). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (30448). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Copy link
Collaborator

@kavelrao kavelrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this fix, I hadn't run it on 3090 before so I didn't encounter this.

But I wonder if there's a more general way to address it by detecting the memory limit of the device instead of special-casing GTX 3090? I bet other consumer GPUs would have the same issue.

wildguard/wildguard.py Show resolved Hide resolved
Copy link
Collaborator

@kavelrao kavelrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, let me know if you'd rather I just implement this fix instead of going back and forth. Thank you for your patience!

self.model = LLM(model=MODEL_NAME)
gpu_name = torch.cuda.get_device_name(0)
if gpu_name == 'NVIDIA GeForce RTX 3090':
self.model = LLM(model=MODEL_NAME,max_model_len=30448)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.model = LLM(model=MODEL_NAME,max_model_len=30448)
self.model = LLM(model=MODEL_NAME, max_model_len=30448)

Comment on lines +182 to +183
gpu_name = torch.cuda.get_device_name(0)
if gpu_name == 'NVIDIA GeForce RTX 3090':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gpu_name = torch.cuda.get_device_name(0)
if gpu_name == 'NVIDIA GeForce RTX 3090':
if torch.cuda.get_device_properties(0).total_memory < 30e9:

And to make this work with ephemeral_model=True, you will also need to add a parameter for max_model_len to subprocess_inference_with_vllm and pass it through to create_and_inference_with_vllm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants