llama2qt.c

Clean C language version of quantizing llama2 model and running quantized llama2 model.

The code contains some modifications (mainly about quantization and running quantized model) based on llama.c (Inference Llama 2 in one file of pure C) from Andrej Karpathy.

Simple instructions:

8bit quantization, grouping per layer, without block:

gcc -O3 -o quantize quantize_8bit.c -lm

./quantize {model_name}.bin

Inference 8bit quantization

gcc -O3 -march=native runq.c -o runq -lm

./runq llama2_7b_8bit.bin -t {temperature} -p {top_p} -n {max_token} -i "{prompt}"

8bit quantization, grouping by 64 * 64 block:

gcc -O3 -o quantize quantize_8bit_64block.c -lm

A quick test, using the Google colab:

More details can be found in the README.md .

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
models/llama-2-7b		models/llama-2-7b
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
export_meta_llama_bin.py		export_meta_llama_bin.py
export_meta_llama_hf_bin.py		export_meta_llama_hf_bin.py
quantization_8bit_demo.ipynb		quantization_8bit_demo.ipynb
quantize_8bit.c		quantize_8bit.c
quantize_8bit_64block.c		quantize_8bit_64block.c
runq.c		runq.c
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
win.c		win.c
win.h		win.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama2qt.c

8bit quantization, grouping per layer, without block:

Inference 8bit quantization

8bit quantization, grouping by 64 * 64 block:

A quick test, using the Google colab:

About

Releases

Packages

Languages

License

elphinkuo/llamaqt.c

Folders and files

Latest commit

History

Repository files navigation

llama2qt.c

8bit quantization, grouping per layer, without block:

Inference 8bit quantization

8bit quantization, grouping by 64 * 64 block:

A quick test, using the Google colab:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages