Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
This is the official repository implements Smart Parallel Auto-Correct Decoding (SPACE), a novel approach to accelerate inference of LLMs by integrating semi-autoregressive inference and draft-then-verify capabilities.
Fig. 1: A visual comparison between conventional AR inference (left) and SPACE inference (right) is illustrated. In AR inference, token generation proceeds in a sequential manner, with only one token output per decoding step. In SPACE inference, the input token sequence (i.e., "LLMs are") is augmented with k+1 groups of mask tokens and k candidate tokens (i.e., "auto" and "model"). The candidate tokens undergo verification to obtain accepted tokens (i.e., "auto" and "regressive"), and k new candidate tokens (i.e., "model" and "<s>") are generated from one of the mask groups after a single model invocation. SPACE allows for a variable number of tokens to be generated in each step, with the quantity ranging from a minimum of 1 to a maximum of k+1.
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao
- 🔥🔥🔥 News:
2024/5/16
: SPACE was accepted in ACL 2024 Findings!
Install Dependencies
pip install -r requirements.txt
We use LLaMa-2-7B as the base model for SPACE training in the example.
Download the checkpoint of LLaMa-2-7B and then change the model_name_or_path
in run_sft_multi_node.sh. Run the following comment to start training on one machine with 8 GPUs.
bash run_sft_multi_node.sh
For evaluation, change llm_dir
in run_eval.sh to the output dir and run the following for evaluation
bash run_eval.sh
[2024/5/19] We have released a Vicuna-7B model trained with SPACE, please download the checkpoint from HF and run the following for evalution.
python tests/eval_infer.py --llm_dir=path/to/model --mask_id=32002 --dataset="human_eval" --mask_num=5 --do_sample=false --use_cache=true --model_type=llama --mask_diff=false
This repository is licensed under the Apache-2.0 License.
If this work is helpful, please kindly cite as:
@article{yi2024generation,
title={Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding},
author={Yi, Hanling and Lin, Feng and Li, Hongbin and Ning, Peiyang and Yu, Xiaotian and Xiao, Rong},
journal={arXiv preprint arXiv:2402.11809},
year={2024}
}
This repo benefits from LLaMA Factory and FastChat. Thanks for their wonderful works.