Skip to content

[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

License

Notifications You must be signed in to change notification settings

ThisisBillhe/ZipCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

This repository provides the implementation for our paper "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification". Our approach introduces an adaptive KV cache mixed-precision quantization method for LLMs.

arXiv | BibTeX

Getting Started

Follow the step-by-step tutorial to set up ZipCache.

Step 1: Setup

Create a virtual environment and install dependencies as specified by requirements.txt. Then install flash_attn and zipcache as follows:

pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -e .

Step 2: Download Pretrained Models

Download the pretrained LLaMA model from huggingface and modify the MODEL_PATH in zipcache_generation_demo.py.

Step 3: Inference with ZipCache

python3 zipcache_generation_demo.py

BibTeX

If you find this work useful for your research, please consider citing:

@article{he2024zipcache,
  title={ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification},
  author={He, Yefei and Zhang, Luoming and Wu, Weijia and Liu, Jing and Zhou, Hong and Zhuang, Bohan},
  journal={arXiv preprint arXiv:2405.14256},
  year={2024}
}

About

[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages