You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/Compress_Align/FastChat/fastchat/llm_judge/gen_model_answer.py", line 103, in get_model_answers
model, tokenizer = load_model(
File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 379, in load_model
model, tokenizer = adapter.load_model(model_path, kwargs)
File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 124, in load_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3823, in from_pretrained
hf_quantizer.postprocess_model(model)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/base.py", line 195, in postprocess_model
return self._process_model_after_weight_loading(model, **kwargs)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/quantizer_gptq.py", line 80, in _process_model_after_weight_loading
model = self.optimum_quantizer.post_init_model(model)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/optimum/gptq/quantizer.py", line 595, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
I was able to pass device map as cuda in the load_model() function of BaseModelAdapter class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
# device_map='cuda', # TODO: Change made to support quantized model instead of disabling exllama!
**from_pretrained_kwargs,
)
Hi team,
I've a question related to generating model responses using GPTQ.
I've compressed Llama-2-7B using basic AutoGPTQ using transformers.
Once the model is saved, I am trying to generate the answers using the following command
python gen_model_answer.py --model-path directory/ --model-id llama-2-7b-gptq-4
But this throws the following error -
I was able to pass device map as cuda in the
load_model()
function ofBaseModelAdapter
class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)Is there a way to generate answers and evaluate the quantized models the same way as this step - https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Or am I missing something fundamental?
Tagging relevant issues
disable_exllama=True
in the quantization config objec #2459The main fix suggested is to disable exllama but that increases the inference speed a lot!
The text was updated successfully, but these errors were encountered: