LMDeploy TurboMind engine supports the inference of 4bit quantized models that are quantized both by AWQ and GPTQ, but its quantization module only supports the AWQ quantization algorithm.
The following NVIDIA GPUs are available for AWQ/GPTQ INT4 inference:
- V100(sm70): V100
- Turing(sm75): 20 series, T4
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
- Ada Lovelace(sm89): 40 series
Before proceeding with the quantization and inference, please ensure that lmdeploy is installed by following the installation guide
The remainder of this article is structured into the following sections:
A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.
export HF_MODEL=internlm/internlm2_5-7b-chat
export WORK_DIR=internlm/internlm2_5-7b-chat-4bit
lmdeploy lite auto_awq \
$HF_MODEL \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 2048 \
--w-bits 4 \
--w-group-size 128 \
--batch-size 1 \
--work-dir $WORK_DIR
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the internlm/internlm2_5-7b-chat model, the command can be condensed as:
lmdeploy lite auto_awq internlm/internlm2_5-7b-chat --work-dir internlm2_5-7b-chat-4bit
Note:
- We recommend that you specify the --work-dir parameter, including the model name as demonstrated in the example above. This facilitates LMDeploy in fuzzy matching the --work-dir with an appropriate built-in chat template. Otherwise, you will have to designate the chat template during inference.
- If the quantized model’s accuracy is compromised, it is recommended to enable --search-scale for re-quantization and increase the --batch-size, for example, to 8. When search_scale is enabled, the quantization process will take more time. The --batch-size affects the amount of memory used, which can be adjusted according to actual conditions as needed.
Upon completing quantization, you can engage with the model efficiently using a variety of handy tools. For example, you can initiate a conversation with it via the command line:
lmdeploy chat ./internlm2_5-7b-chat-4bit --model-format awq
Alternatively, you can start the gradio server and interact with the model through the web at http://{ip_addr}:{port
lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server_name {ip_addr} --server_port {port} --model-format awq
Please refer to OpenCompass about model evaluation with LMDeploy. Here is the guide
Trying the following codes, you can perform the batched offline inference with the quantized model:
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("./internlm2_5-7b-chat-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
For more information about the pipeline parameters, please refer to here.
In addition to performing inference with the quantized model on localhost, LMDeploy can also execute inference for the 4bit quantized model derived from AWQ algorithm available on Huggingface Hub, such as models from the lmdeploy space and TheBloke space
# inference with models from lmdeploy space
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("lmdeploy/llama2-chat-70b-4bit",
backend_config=TurbomindEngineConfig(model_format='awq', tp=4))
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
# inference with models from thebloke space
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
pipe = pipeline("TheBloke/LLaMA2-13B-Tiefighter-AWQ",
backend_config=TurbomindEngineConfig(model_format='awq'),
chat_template_config=ChatTemplateConfig(model_name='llama2')
)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
LMDeploy's api_server
enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
lmdeploy serve api_server ./internlm2_5-7b-chat-4bit --backend turbomind --model-format awq
The default port of api_server
is 23333
. After the server is launched, you can communicate with server on terminal through api_client
:
lmdeploy serve api_client http://0.0.0.0:23333
You can overview and try out api_server
APIs online by swagger UI at http://0.0.0.0:23333
, or you can also read the API specification from here.
We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using profile_generation.py. And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.
model | llm-awq | mlc-llm | turbomind |
---|---|---|---|
Llama-2-7B-chat | 112.9 | 159.4 | 206.4 |
Llama-2-13B-chat | N/A | 90.7 | 115.8 |