Running FastChat on GPTQ (and quantized) models #3530

NamburiSrinath · 2024-09-19T20:58:43Z

Hi team,

I've a question related to generating model responses using GPTQ.

I've compressed Llama-2-7B using basic AutoGPTQ using transformers.

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quantizer = GPTQQuantizer(bits=4, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

save_folder = "directory/"
quantizer.save(model, save_folder)
tokenizer.save_pretrained(save_folder)

Once the model is saved, I am trying to generate the answers using the following command

python gen_model_answer.py --model-path directory/ --model-id llama-2-7b-gptq-4

But this throws the following error -

File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/llm_judge/gen_model_answer.py", line 103, in get_model_answers
    model, tokenizer = load_model(
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 379, in load_model
    model, tokenizer = adapter.load_model(model_path, kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 124, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3823, in from_pretrained
    hf_quantizer.postprocess_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/base.py", line 195, in postprocess_model
    return self._process_model_after_weight_loading(model, **kwargs)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/quantizer_gptq.py", line 80, in _process_model_after_weight_loading
    model = self.optimum_quantizer.post_init_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/optimum/gptq/quantizer.py", line 595, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

I was able to pass device map as cuda in the load_model() function of BaseModelAdapter class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)

model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                # device_map='cuda', # TODO: Change made to support quantized model instead of disabling exllama!
                **from_pretrained_kwargs,
            )

Is there a way to generate answers and evaluate the quantized models the same way as this step - https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Or am I missing something fundamental?

Tagging relevant issues

The main fix suggested is to disable exllama but that increases the inference speed a lot!

The text was updated successfully, but these errors were encountered:

NamburiSrinath mentioned this issue Sep 19, 2024

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec #2459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running FastChat on GPTQ (and quantized) models #3530

Running FastChat on GPTQ (and quantized) models #3530

NamburiSrinath commented Sep 19, 2024 •

edited

Loading

Running FastChat on GPTQ (and quantized) models #3530

Running FastChat on GPTQ (and quantized) models #3530

Comments

NamburiSrinath commented Sep 19, 2024 • edited Loading

NamburiSrinath commented Sep 19, 2024 •

edited

Loading