This repository provides the implementation for our paper "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification". Our approach introduces an adaptive KV cache mixed-precision quantization method for LLMs.
Follow the step-by-step tutorial to set up ZipCache.
Create a virtual environment and install dependencies as specified by requirements.txt. Then install flash_attn and zipcache as follows:
pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -e .
Download the pretrained LLaMA model from huggingface and modify the MODEL_PATH in zipcache_generation_demo.py.
python3 zipcache_generation_demo.py
If you find this work useful for your research, please consider citing:
@article{he2024zipcache,
title={ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification},
author={He, Yefei and Zhang, Luoming and Wu, Weijia and Liu, Jing and Zhou, Hong and Zhuang, Bohan},
journal={arXiv preprint arXiv:2405.14256},
year={2024}
}