Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting gguf fp16 & bf16 to hf is not supported. #31762

Closed
2 of 4 tasks
PenutChen opened this issue Jul 3, 2024 · 10 comments · Fixed by #31783
Closed
2 of 4 tasks

Converting gguf fp16 & bf16 to hf is not supported. #31762

PenutChen opened this issue Jul 3, 2024 · 10 comments · Fixed by #31783

Comments

@PenutChen
Copy link
Contributor

PenutChen commented Jul 3, 2024

System Info

transformers==4.42.3
torch==2.3.0
numpy==1.26.4
gguf==0.6.0

Who can help?

@SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import os
from transformers import AutoModelForCausalLM

gguf_path = "path/to/llama3-8b.fp16.gguf"  # or bf16
model_id = os.path.dirname(gguf_path)
gguf_file = os.path.basename(gguf_path)

model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)

Expected behavior

Besides quantization, only F32 is implemented. FP16 and BF16 are not yet supported.

fp16 error log:

Converting and de-quantizing GGUF tensors...:   0%|                         | 0/291 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 146, in load_gguf_checkpoint
    weights = load_dequant_gguf_tensor(shape=shape, ggml_type=tensor.tensor_type, data=tensor.data)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/integrations/ggml.py", line 507, in load_dequant_gguf_tensor
    raise NotImplementedError(
NotImplementedError: ggml_type 1 not implemented - please raise an issue on huggingface transformers: https://github.com/huggingface/transformers/issues/new/choose

bf16 error log:

Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 524, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 965, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 719, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 81, in load_gguf_checkpoint
    reader = GGUFReader(gguf_checkpoint_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 116, in __init__
    self._build_tensors(offs, tensors_fields)
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 239, in _build_tensors
    ggml_type = GGMLQuantizationType(raw_dtype[0])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 714, in __call__
    return cls.__new__(cls, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 1137, in __new__
    raise ve_exc
ValueError: 30 is not a valid GGMLQuantizationType

I tried to add F16 to GGML_TYPES:

GGML_TYPES = {
    "F32": 0,
    "F16": 1,
    # ...
}

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    # ...

I'm not sure if this is correct, but after converting to hf, the PPL is over 1000.

@PenutChen
Copy link
Contributor Author

PenutChen commented Jul 3, 2024

I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed.

@PenutChen
Copy link
Contributor Author

PenutChen commented Jul 3, 2024

It's easy to support GGUF FP16. Since BF16 is not supported by NumPy, my current workaround is to convert BF16 to FP16 using PyTorch, but it's not ideal to rely on PyTorch at this step.

Reference: main...PenutChen:transformers:main

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    elif ggml_type == GGML_TYPES["BF16"]:
        import torch
        data_uint8 = data.view(np.uint8)
        tensor_uint8 = torch.from_numpy(data_uint8)
        values = tensor_uint8.view(torch.bfloat16).float().numpy()

Note that BF16 support requires modifying some code in gguf-py. Since the latest version of gguf-py from the llama.cpp repo doesn't work with the current HF integration (#31725), I modified the version from PyPI as follows:

class GGMLQuantizationType(IntEnum):
    F32  = 0
    F16  = 1
    BF16 = 30
    # ...

GGML_QUANT_SIZES = {
    GGMLQuantizationType.F32:  (1, 4),
    GGMLQuantizationType.F16:  (1, 2),
    GGMLQuantizationType.BF16: (1, 2),
    # ...
}

@LysandreJik
Copy link
Member

Hey @SunMarc, would you have some bandwidth to take a look at this ? :)

@SunMarc
Copy link
Member

SunMarc commented Jul 3, 2024

Hey @PenutChen, thanks for your research ! I think that we should just support FP16 first since supporting BF16 would require a new gguf release + transformers gguf integration is not compatible yet. LMK what you think ! If you have some time, would you like a open a PR ? Otherwise, I will do it !

@PenutChen
Copy link
Contributor Author

PenutChen commented Jul 4, 2024

@SunMarc Sure, I will do the necessary checks and open a PR! By the way, gguf-py on PyPI has not been updated for a long time. Most developers from llama.cpp seem to use gguf-py from the source. I think if we want to improve this integration, we should discuss it with the developers of llama.cpp.

@Lin-xs
Copy link

Lin-xs commented Jul 24, 2024

I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed.

Hi @PenutChen ,
Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

@PenutChen
Copy link
Contributor Author

Hi @PenutChen , Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

Hi @Lin-xs, this might be related to the incorrect reversed permutation implementation when dequantizing the model with GQA. This should be fixed in the latest version of Transformers by #31788.

@Lin-xs
Copy link

Lin-xs commented Jul 24, 2024

Hi @PenutChen , Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

Hi @Lin-xs, this might be related to the incorrect reversed permutation implementation when dequantizing the model with GQA. This should be fixed in the latest version of Transformers by #31788.

It works, thanks!

@SunMarc SunMarc reopened this Jul 24, 2024
@SunMarc
Copy link
Member

SunMarc commented Jul 24, 2024

Let's keep this open for bf16. After we fix the compatibility issue with the new gguf version, we can add bf16 cc @PenutChen

@huggingface huggingface deleted a comment from github-actions bot Aug 19, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants