Add support for H2O cache eviction with LLaMA #35381

justincharney · 2024-12-21T04:38:58Z

What does this PR do?

We implement the Heavy-Hitter Oracle (H2O) cache eviction strategy in Huggingface transformers, which selectively retains a balance of KV pairs that are recent or contribute most to the cumulative attention scores while evicting less important ones to maintain a fixed cache size. Our implementation identifies and preserves these “heavy hitter” tokens during inference, maintaining generation quality while dramatically reducing memory requirements.

Key features:

Dynamic tracking of token importance through attention scores
Configurable ratio between recent and heavy-hitter sections
Added support with LLaMA through "post_processing" the KV cache in the LlamaAttention

Fixes #30758

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Yes, discussed in issue Implement kv cache sparsity like H2O with attention score #30758
[N/A] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Added informative docstrings to public methods following guidelines
Did you write any new necessary tests?
Added test to validate that H2OCache works correctly with LLaMA model generation by ensuring it can produce non-empty text responses while using cache, specifically testing with TinyLlama-1.1B-Chat in 8-bit quantization on CUDA devices.

Who can review?

@gante as this relates to generation functionality since this touches core caching infrastructure.

Outline of code

Files modified:

src/transformers/cache_utils.py: Added a new class H2OCache
src/transformers/models/llama/modeling_llama.py: Added post processing function to track attention weights to identify heavy hitters
benchmark/h20: Added benchmarking scripts to compare H2OCache performance with DynamicCache
tests/h2O: Added tests to run LLM with H20Cache

Executing code

To test an LLM with the H20 cache mechanism, run the following:

python -m pytest -n auto --dist=loadfile -s -v ./tests/h2O/test_h2O.py

Results

We demonstrate that H2O achieves over 80% reduction in KV cache size while incurring less than a 5% reduction in throughput. This represents a significant improvement over QuantizedCache, which introduces substantially higher overhead for similar memory savings.

Add support for H2O cache eviction with LLaMA

13637f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for H2O cache eviction with LLaMA #35381

Add support for H2O cache eviction with LLaMA #35381

justincharney commented Dec 21, 2024

Add support for H2O cache eviction with LLaMA #35381

Are you sure you want to change the base?

Add support for H2O cache eviction with LLaMA #35381

Conversation

justincharney commented Dec 21, 2024

What does this PR do?

Before submitting

Who can review?

Outline of code

Executing code

Results