- Support for RWKV v5, v6 and experimentally v7 models(WIP)
- Inference RWKV using QNN SDK, with Qualcomm CPU, GPU or HTP (Hexagon Tensor Processor) as the backend.
- Support for whole-model float16 inference (since Qualcomm HTP cannot do float32 math).
- Support for activation INT16 and weights INT8 quantized inference (with some key operations running with float16).
- Support for activation INT16 and weights INT4/INT8 mixed quantized inference.
- Download and install the QNN SDK from the Qualcomm Developer Network.
- Setup the QNN SDK environment by following the instructions in Qualcomm's documents.
- Setup the $QNN_SDK_ROOT environment variable to point to the QNN SDK installation directory. It should by default be installed at /opt/qcom/aistack/qnn/{version}.
- (Optional) Install the AIMET toolkit for aimet quantization methods: https://quic.github.io/aimet-pages/releases/latest/install/index.html#quick-install
- This project has been verified with:
- QNN SDK 2.26.0
- python==3.10 (as is recommended by QNN SDK documentation)
- onnx==1.16.1
- protobuf==3.20.2 (Mandatory for making both QNN's onnx-converter and onnx==1.16.1 working properly)
- torch==2.2.2 (although QNN SDK is verified to work with torch==1.13.0, it's okay to use latest version of torch since we are only using torch for model conversion and onnx exporting) (2.2.2 is recommended by AIMET toolkit)
- Hardware: Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)
convert_model.py
: usage: convert_model.py [-h] [--chunks CHUNKS] [--use_qnn_quant] [--act_bitwidth ACT_BITWIDTH] [--weights_bitwidth WEIGHTS_BITWIDTH] [--ext_embedding] model- Convert the model:
python convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 4
make_calibration_samples.py
: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunks- Make calibration samples:
python make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
- Convert the model file:
python convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 2 --use_qnn_quant --calib_data_path ./samples_1b6
- The act_bitwidth and weights_bitwidth default to 16 and 8 respectively.
- Note: Please keep the
chunks
parameter the same in both scripts.
make_calibration_samples.py
: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunks- Make calibration samples:
python make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
- Convert the model file:
--linear_param_encodings quant_encodings/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_mse_rwkv_gptq_exceptions_asym_torch_w4.encodings
(The quantization encodings are either from the pre-calculated ones (GDrive), or generated using AIMET. Refer to: AIMET_quant.md) - Some large Linear modules are quantized to 4-bit weights, while some are kept 8-bit for better accuracy.
- Note: Please keep the
chunks
parameter the same in both scripts.
The outputs will be in lib/
directory. The model library contains weights, as well as the functions to prepare the graph. This can either be called on device using libraries in lib/aarch64-android/
, or be prepared on the x86 host machine using lib/x86_64-linux-clang/
to generate an HTP context cache. Qualcomm HTP has a limitation on the size of the model library file, so the model will be split into multiple chunks.
make_context_cache_binary.py
: usage: make_context_cache_binary.py [-h] model_lib output_path {SM8650,SM8550,SC8380}- Example:
$ python make_context_cache_binary.py ./lib/x86_64-linux-clang/libRWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.so output/ SM8650
- The script will automatically process each of the chunks together.
- The output would be in
output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
andoutput/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
.
- Build the demo code:
make -C librwkv-qualcomm
- Push the binary and the HTP context cache to the device:
adb push librwkv-qualcomm/obj/local/arm64-v8a/rwkv-qualcomm-demo /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin /data/local/tmp/
- Push the tokenizer model to the device:
adb push assets/brwkv_vocab_v20230424.txt /data/local/tmp/
- Push these QNN libs to the device
/data/local/tmp/
(Please change the HTP V75 version to the one you have):
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnSystem.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpV75Stub.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so
- If using external embedding, please push
onnx/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.emb
to/data/local/tmp/rwkv/
too. - Finally run the demo code:
adb shell
$ cd /data/local/tmp
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp
$ # Specify the path to the first model chunk. The second chunk will be loaded automatically.
$ ./rwkv-qualcomm-demo brwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
- TODO
RWKV v6 1B6 A16W4
130|houji:/data/local/tmp/rwkv $ ./rwkv-qualcomm-demo b_rwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Loading model context binary from RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Buffer size: 719802320
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
Buffer size: 586727640
User: 请为我写一首诗。
Assistant: 当然,请告诉我你喜欢什么类型的诗歌。
User: 请写一首描写秋天景色的诗。
Assistant: 秋意渐浓,寒意渐深,
大地已是金黄如火,
落英纷飞,树影绰约,
人心也随之变得清静。
夜空中的繁星在闪闪,
思念似要被所有握住,
但又像是永不消散的孤注,
在这个秋天里如此特别。
请问这首诗符合您需求吗?
Average time per token: 0.0235644s
Average tokens per second: 42.4368
Running on the Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)
Model | Precision | Generation Tokens per second | LAMBADA ppl, acc |
---|---|---|---|
RWKV v6 1.6B | att-a16w8 + ffn-a16w4 | 42.4368 | 5.09183,65.4182% |
RWKV v6 1.6B | a16w8 | 31.6564 | 4.75009,66.3497% |
RWKV v6 1.6B | fp16 | 15.0434 | 4.63598,67.2618% |
RWKV v6 3B | att-a16w8 + ffn-a16w4 | 21.3172 | 4.46606,68.8725% |
RWKV v6 3B | a16w8 | 16.2146 | 3.9039,71.3647% |
(Currently QNN's INT4 quantization is the naive linear per-channel quantization, together with the INT16 activation quantization, the perplexity gets a bit worse than the INT8 models. LAMBADA test accuracy seems lower but still acceptable.)
(Experimental) Running with custom WKV kernel
Model | Precision | Generation Tokens per second | LAMBADA ppl, acc |
---|---|---|---|
RWKV v6 1.6B | att-a16w8 + ffn-a16w4 | 47.6698 | 5.09183,65.4182% |
RWKV v6 7B | a16w4 | 12.9782 | TODO |
Model | Precision | Generation Tokens per second | LAMBADA ppl, acc |
---|---|---|---|
RWKV v6 1.6B | att-a16w8 + ffn-a16w4 | 32.6703 | 4.65837,66.7378% |
RWKV v6 1.6B | a16w8 | 26.0707 | 4.59243,67.3006% |
RWKV v6 1.6B | fp16 | 15.0434 | 4.63598,67.2618% |
RWKV v6 3B | att-a16w8 + ffn-a16w4 | 17.3968 | 4.46606,68.8725% |
- RWKV-5-World-0.4B-v2-20231113-ctx4096, fp16:
Average tokens per second: 50.7313
- RWKV-5-ABC-82M-v1-20230901-ctx1024, fp16:
Average tokens per second: 142.286
- Add demo code for running inference on the device.
- Add support for A16W8 quantized inference.
- Add support for A16W4 quantized inference with AIMET quantization.
- Add document for running on Snapdragon X Elite laptops.
- Sequential prefilling on device.
- Package a library for easy use and integration.