-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KV cache error in wildguard.py #2
base: main
Are you sure you want to change the base?
Conversation
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (30448). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this fix, I hadn't run it on 3090 before so I didn't encounter this.
But I wonder if there's a more general way to address it by detecting the memory limit of the device instead of special-casing GTX 3090? I bet other consumer GPUs would have the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments, let me know if you'd rather I just implement this fix instead of going back and forth. Thank you for your patience!
self.model = LLM(model=MODEL_NAME) | ||
gpu_name = torch.cuda.get_device_name(0) | ||
if gpu_name == 'NVIDIA GeForce RTX 3090': | ||
self.model = LLM(model=MODEL_NAME,max_model_len=30448) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model = LLM(model=MODEL_NAME,max_model_len=30448) | |
self.model = LLM(model=MODEL_NAME, max_model_len=30448) |
gpu_name = torch.cuda.get_device_name(0) | ||
if gpu_name == 'NVIDIA GeForce RTX 3090': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu_name = torch.cuda.get_device_name(0) | |
if gpu_name == 'NVIDIA GeForce RTX 3090': | |
if torch.cuda.get_device_properties(0).total_memory < 30e9: |
And to make this work with ephemeral_model=True
, you will also need to add a parameter for max_model_len
to subprocess_inference_with_vllm
and pass it through to create_and_inference_with_vllm
.
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (30448). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.