Converting gguf fp16 & bf16 to hf is not supported. #31762

PenutChen · 2024-07-03T03:24:13Z

System Info

transformers==4.42.3
torch==2.3.0
numpy==1.26.4
gguf==0.6.0

Who can help?

@SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import os
from transformers import AutoModelForCausalLM

gguf_path = "path/to/llama3-8b.fp16.gguf"  # or bf16
model_id = os.path.dirname(gguf_path)
gguf_file = os.path.basename(gguf_path)

model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)

Expected behavior

Besides quantization, only F32 is implemented. FP16 and BF16 are not yet supported.

fp16 error log:

Converting and de-quantizing GGUF tensors...:   0%|                         | 0/291 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 146, in load_gguf_checkpoint
    weights = load_dequant_gguf_tensor(shape=shape, ggml_type=tensor.tensor_type, data=tensor.data)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/integrations/ggml.py", line 507, in load_dequant_gguf_tensor
    raise NotImplementedError(
NotImplementedError: ggml_type 1 not implemented - please raise an issue on huggingface transformers: https://github.com/huggingface/transformers/issues/new/choose

bf16 error log:

Traceback (most recent call last):
  File "/data2/Penut/LLM-Backend/Testing.py", line 9, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=gguf_file)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 524, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 965, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/configuration_utils.py", line 719, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 81, in load_gguf_checkpoint
    reader = GGUFReader(gguf_checkpoint_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 116, in __init__
    self._build_tensors(offs, tensors_fields)
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/site-packages/gguf/gguf_reader.py", line 239, in _build_tensors
    ggml_type = GGMLQuantizationType(raw_dtype[0])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 714, in __call__
    return cls.__new__(cls, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data2/Penut/.miniconda/envs/Py311/lib/python3.11/enum.py", line 1137, in __new__
    raise ve_exc
ValueError: 30 is not a valid GGMLQuantizationType

I tried to add F16 to GGML_TYPES:

GGML_TYPES = {
    "F32": 0,
    "F16": 1,
    # ...
}

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    # ...

~~I'm not sure if this is correct, but after converting to hf, the PPL is over 1000.~~

The text was updated successfully, but these errors were encountered:

PenutChen · 2024-07-03T04:53:45Z

I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed.

PenutChen · 2024-07-03T08:14:44Z

It's easy to support GGUF FP16. Since BF16 is not supported by NumPy, my current workaround is to convert BF16 to FP16 using PyTorch, but it's not ideal to rely on PyTorch at this step.

Reference: main...PenutChen:transformers:main

def load_dequant_gguf_tensor(shape, ggml_type, data):
    if ggml_type == GGML_TYPES["F32"]:
        values = data
    elif ggml_type == GGML_TYPES["F16"]:
        values = data
    elif ggml_type == GGML_TYPES["BF16"]:
        import torch
        data_uint8 = data.view(np.uint8)
        tensor_uint8 = torch.from_numpy(data_uint8)
        values = tensor_uint8.view(torch.bfloat16).float().numpy()

Note that BF16 support requires modifying some code in gguf-py. Since the latest version of gguf-py from the llama.cpp repo doesn't work with the current HF integration (#31725), I modified the version from PyPI as follows:

class GGMLQuantizationType(IntEnum):
    F32  = 0
    F16  = 1
    BF16 = 30
    # ...

GGML_QUANT_SIZES = {
    GGMLQuantizationType.F32:  (1, 4),
    GGMLQuantizationType.F16:  (1, 2),
    GGMLQuantizationType.BF16: (1, 2),
    # ...
}

LysandreJik · 2024-07-03T09:33:20Z

Hey @SunMarc, would you have some bandwidth to take a look at this ? :)

SunMarc · 2024-07-03T12:09:03Z

Hey @PenutChen, thanks for your research ! I think that we should just support FP16 first since supporting BF16 would require a new gguf release + transformers gguf integration is not compatible yet. LMK what you think ! If you have some time, would you like a open a PR ? Otherwise, I will do it !

PenutChen · 2024-07-04T00:44:13Z

@SunMarc Sure, I will do the necessary checks and open a PR! By the way, gguf-py on PyPI has not been updated for a long time. Most developers from llama.cpp seem to use gguf-py from the source. I think if we want to improve this integration, we should discuss it with the developers of llama.cpp.

Lin-xs · 2024-07-24T03:22:21Z

I found that the PPL issue is related to Llama3 or llama.cpp. It doesn't happen with TinyLlama. I'll create another issue to discuss if needed.

Hi @PenutChen ,
Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

PenutChen · 2024-07-24T04:02:52Z

Hi @PenutChen , Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

Hi @Lin-xs, this might be related to the incorrect reversed permutation implementation when dequantizing the model with GQA. This should be fixed in the latest version of Transformers by #31788.

Lin-xs · 2024-07-24T05:38:10Z

Hi @PenutChen , Do you know the reaseon of this PPL issue? I also get a very large PPL when using a dequantized Llama-3 model from Q4_K_M GGUF in transformers.

Hi @Lin-xs, this might be related to the incorrect reversed permutation implementation when dequantizing the model with GQA. This should be fixed in the latest version of Transformers by #31788.

It works, thanks!

SunMarc · 2024-07-24T16:06:02Z

Let's keep this open for bf16. After we fix the compatibility issue with the new gguf version, we can add bf16 cc @PenutChen

github-actions · 2024-09-13T08:07:30Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

PenutChen mentioned this issue Jul 4, 2024

Support dequantizing GGUF FP16 format #31783

Merged

5 tasks

SunMarc closed this as completed in #31783 Jul 24, 2024

SunMarc reopened this Jul 24, 2024

huggingface deleted a comment from github-actions bot Aug 19, 2024

github-actions bot closed this as completed Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting gguf fp16 & bf16 to hf is not supported. #31762

Converting gguf fp16 & bf16 to hf is not supported. #31762

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

LysandreJik commented Jul 3, 2024

SunMarc commented Jul 3, 2024

PenutChen commented Jul 4, 2024 •

edited

Loading

Lin-xs commented Jul 24, 2024

PenutChen commented Jul 24, 2024

Lin-xs commented Jul 24, 2024

SunMarc commented Jul 24, 2024

github-actions bot commented Sep 13, 2024

Converting gguf fp16 & bf16 to hf is not supported. #31762

Converting gguf fp16 & bf16 to hf is not supported. #31762

Comments

PenutChen commented Jul 3, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

PenutChen commented Jul 3, 2024 • edited Loading

PenutChen commented Jul 3, 2024 • edited Loading

LysandreJik commented Jul 3, 2024

SunMarc commented Jul 3, 2024

PenutChen commented Jul 4, 2024 • edited Loading

Lin-xs commented Jul 24, 2024

PenutChen commented Jul 24, 2024

Lin-xs commented Jul 24, 2024

SunMarc commented Jul 24, 2024

github-actions bot commented Sep 13, 2024

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 3, 2024 •

edited

Loading

PenutChen commented Jul 4, 2024 •

edited

Loading