-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to correctly use Prefixing Tuning? #869
Comments
@pacman100 Would you mind taking a look? |
have you been solved this problem? |
@WhoopeeHg no unfortunately I decided to skip the prefix tuning part since I found that to be less effective than P-Tuning or LoRA on my dataset. |
Based on my discussion with others, the problem seems to surface when we load the model with 8 bit quantization. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
@pacman100 @BenjaminBossan this issue occurs even without 8 bit quantization and basically renders HF's integration of Prefix Tuning useless, unless this bug is fixed. |
@vikram71198 Could you please give more details? Ideally some code to reproduce the error and the PEFT version you're using. |
My environment:
This is the code snippet I'm using which has been adapted from here : import os
import random
import os
import numpy as np
import pandas as pd
import torch
import datasets
from torch.utils.data import Dataset, DataLoader
import transformers
import peft
import trl
import json
from pprint import pprint
import flash_attn
import accelerate
from transformers import BitsAndBytesConfig
from tqdm import tqdm as tqdm
import mlflow
max_length = 500
lr = 2e-4
num_epochs = 5
batch_size = 1
num_virtual_tokens = 30
random_seed = 42
model_name = "teknium/OpenHermes-2.5-Mistral-7B"
text_column = "Transcript"
label_column = "RFC"
def get_prompt(transcript: str) -> str:
prompt = """Transcript:
{transcript}
---
I want you to act as a transcript analysis expert. I have provided you with a transcript between agent & customer above and your goal is to summarize the reason why the customer calls up the agent. If there is no discernible reason, output "No reason identified".
Answer:"""
return prompt.format(transcript = transcript)
def get_mistral_prompt(transcript: str, system_message : str = "") -> str:
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": get_prompt(transcript)}
]
return tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_name, device_map = "auto", torch_dtype = torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.use_default_system_prompt = False
def preprocess_function(examples):
batch_size = len(examples[text_column])
# inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
inputs = [get_mistral_prompt(x) for x in examples[text_column]]
targets = [str(x) for x in examples[label_column]]
model_inputs = tokenizer(inputs)
labels = tokenizer(targets)
for i in range(batch_size):
sample_input_ids = model_inputs["input_ids"][i]
label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
# print(i, sample_input_ids, label_input_ids)
model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
# print(model_inputs)
for i in range(batch_size):
sample_input_ids = model_inputs["input_ids"][i]
label_input_ids = labels["input_ids"][i]
#padding if length of this example is smaller than max_seq_length
model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
max_length - len(sample_input_ids)
) + sample_input_ids
model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
"attention_mask"
][i]
labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
#if sequence length of this example exceeds max_seq_length, we're performing truncation here
model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
model_inputs["labels"] = labels["input_ids"]
return model_inputs
transcripts = ["""
Agent: Hello may I know why you're calling today?
Customer: Yeah, I'm calling to cancel my insurance policy
Agent: May I ask you why you chose to do so?
Customer: Yeah, I'm just not interested in the product you have to offer anymore
Agent: That is totally understandable""",
"""
Agent: Hello
Customer: Hello
Agent: Have a nice day
Customer: Thanks"""]
rfcs = ["Customer called to cancel their insurance policy.", "No reason identified."]
import pandas as pd
df = pd.DataFrame({"Transcript": transcripts, "RFC": rfcs})
from datasets import Dataset, DatasetDict
rfc_dataset = DatasetDict()
rfc_dataset["train"] = Dataset.from_pandas(df)
formatted_dataset = rfc_dataset.map(
preprocess_function,
batched=True,
num_proc=1,
remove_columns=rfc_dataset["train"].column_names,
load_from_cache_file=False,
desc="Running tokenizer on dataset",
)
formatted_dataset = formatted_dataset.shuffle(seed = random_seed)
train_dataset = formatted_dataset["train"]
from transformers import default_data_collator
train_dataloader = DataLoader(
train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
from peft import get_peft_config, get_peft_model, PrefixTuningConfig, TaskType, PeftType
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
peft_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=num_virtual_tokens, prefix_projection = False)
model = get_peft_model(model, peft_config)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=(len(train_dataloader) * num_epochs),
)
run_name = "prefix-tuning-v1"
with mlflow.start_run(run_name = run_name):
for epoch in tqdm(range(num_epochs), total = num_epochs):
model.train()
total_loss = 0
for step, batch in enumerate(tqdm(train_dataloader)):
batch = {k: v.to(torch.device("cuda")) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
total_loss += loss.detach().float()
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
train_epoch_loss = total_loss / len(train_dataloader)
train_ppl = torch.exp(train_epoch_loss)
print(f"{epoch=}: {train_ppl=} {train_epoch_loss=}")
mlflow.log_param("epoch", epoch + 1, step = epoch)
mlflow.log_param("train_loss", train_epoch_loss, step = epoch)
mlflow.log_param("train perplexity", train_ppl, step = epoch)
mlflow.end_run() And this is the exact error message I'm seeing: RuntimeError: The size of tensor a (530) must match the size of tensor b (500) at non-singleton dimension 3 The difference between 530 & 500 is There was some investigation done by @Vincent-Li-9701 here, but the bug is still unresolved. @BenjaminBossan please let me know if you need anything else. |
Thanks for providing this reproducer. I could condense the code to the following: import torch
from transformers import MistralConfig, MistralForCausalLM
from peft import PrefixTuningConfig, get_peft_model
# using small mistral for testing, real mistral would also work
model_config = MistralConfig(
vocab_size=32000,
hidden_size=512,
max_position_embeddings=32768,
num_attention_heads=16,
num_hidden_layers=8,
num_key_value_heads=4,
)
model = MistralForCausalLM(model_config)
config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=30)
model = get_peft_model(model, config)
model.config.use_cache = False
input_ids = torch.tensor([[1, 1, 1], [1, 2, 1]])
attention_mask = torch.tensor([[1, 1, 1], [1, 0, 1]])
outputs = model(input_ids=input_ids, attention_mask=attention_mask) which gives the same error:
Unfortunately, even after some digging, I couldn't figure out how to fix this issue yet. I asked around, let's see if someone has a solution. |
One (important) difference between the implementations of Prefix Tuning & other PeFT techniques is evident in the if peft_config.peft_type == PeftType.PREFIX_TUNING:
past_key_values = self.get_prompt(batch_size)
return self.base_model(
input_ids=input_ids, inputs_embeds=inputs_embeds, past_key_values=past_key_values, **kwargs
)
else:
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
# concat prompt labels
if labels is not None:
prefix_labels = torch.full((batch_size, peft_config.num_virtual_tokens), -100).to(labels.device)
kwargs["labels"] = torch.cat((prefix_labels, labels), dim=1)
prompts = self.get_prompt(batch_size=batch_size, task_ids=task_ids)
prompts = prompts.to(inputs_embeds.dtype)
inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
return self.base_model(inputs_embeds=inputs_embeds, **kwargs) It seems like Could this be causing the issue? I picked up on this from #870. |
It could be related, but I don't know enough about these techniques to be sure. Whatever the cause is, it has to be depend on model architecture, as some models work and others don't. I modified the above snippet to check multiple models like so: import torch
from transformers import MistralConfig, MistralForCausalLM, AutoModelForCausalLM
from peft import PrefixTuningConfig, get_peft_model
def get_model(name):
if name == "mistral":
model_config = MistralConfig(
vocab_size=32000,
hidden_size=512,
max_position_embeddings=32768,
num_attention_heads=16,
num_hidden_layers=8,
num_key_value_heads=4,
)
return MistralForCausalLM(model_config)
return AutoModelForCausalLM.from_pretrained(name)
for name in ("gpt2", "facebook/opt-125m", "bigscience/bloomz-560m", "HuggingFaceH4/tiny-random-LlamaForCausalLM", "mistral"):
config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=30)
model = get_model(name)
model = get_peft_model(model, config)
model.config.use_cache = False
input_ids = torch.tensor([[1, 1, 1], [1, 2, 1]])
attention_mask = torch.tensor([[1, 1, 1], [1, 0, 1]])
try:
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
print(f"PASS: model {name} passed")
except Exception as e:
print(f"FAIL: model {name} failed with {e}") and I got:
with transformers 4.38.1. Regarding the Llama error, that could be some kv-cache thing, as I get a different error with older transformers versions -- with 4.36 and 4.37, got the same as error as for mistral. Whether the mistral error could also be related to that, I'm not sure. Pinging @younesbelkada @pacman100 for help. |
I slightly modified your script to: import torch
from transformers import MistralConfig, MistralForCausalLM, AutoModelForCausalLM
from peft import PrefixTuningConfig, get_peft_model
def get_model(name):
if name == "mistral":
model_config = MistralConfig(
vocab_size=32000,
hidden_size=512,
max_position_embeddings=32768,
num_attention_heads=16,
num_hidden_layers=8,
num_key_value_heads=4,
)
return MistralForCausalLM(model_config)
return AutoModelForCausalLM.from_pretrained(name)
for name in ("gpt2", "facebook/opt-125m", "bigscience/bloomz-560m", "HuggingFaceH4/tiny-random-LlamaForCausalLM", "mistral", "teknium/OpenHermes-2.5-Mistral-7B", "meta-llama/Llama-2-7b-chat-hf"):
config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=30)
model = get_model(name)
model = get_peft_model(model, config)
model = model.to(torch.device("cuda"))
model.config.use_cache = False
input_ids = torch.tensor([[1, 1, 1], [1, 2, 1]]).to(torch.device("cuda"))
attention_mask = torch.tensor([[1, 1, 1], [1, 0, 1]]).to(torch.device("cuda"))
try:
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
print(f"PASS: model {name} passed")
except Exception as e:
print(f"FAIL: model {name} failed with {e}")
finally:
torch.cuda.empty_cache()
model.to(torch.device("cpu"))
del model With the script above, this is what I see with
So, it seems like But, clearly there's something nefarious going on. Hope we can find & fix this, because this locks out the community from using this fine tuning method on many, many models that are actually more relevant now. |
So, I downgraded to But, after fine tuning, I mostly saw gibberish outputs from the fine tuned model. I'm about 95% sure that there's nothing wrong with my implementation either & am starting to think that this bug in Prefix Tuning silently creeps up in cases where we don't get the aforementioned So, it seems to me, that currently Hoping someone has a fix for this. |
Any updates on this? |
same |
Good and bad news. With the latest transformers (4.43.2) and PEFT version (0.12.0), the tensor size mismatch error is no longer occurring. However, there is a new error:
This is because |
I attempted changing
Calling trainer.train() on a Mistral model is resulting in a shape mismatch during attention computation:
Any idea why this is? @BenjaminBossan |
@jrrw10 What version of transformers and PEFT are you using? Do you still get the same error when upgrading to the latest versions from |
@BenjaminBossan Yes, I get this error with the latest version of
|
Okay, strange. I tried with transformers commit from transformers import DynamicCache
past_key_values = DynamicCache.from_legacy_cache(past_key_values) before this line. Then I ran the example shown above and I get:
(note that |
@BenjaminBossan Thank you for testing this out. I get the same output with the above testing for a forward pass. While the forward pass tests are successful, calling Debugging prints in
Which then hits the mismatch:
It seems the mismatch occurs because the key length (2078) does not match the key length in the mask (1054). Do you have any insights on why this discrepancy might be occurring during |
I was experimenting in the I changed
since it was incorrectly sized before. This allowed training to get kicked off, however I am getting strange output:
Similarly, if the model is loaded with
I did not have to edit anything in the respective class, and training was kicked off with almost identical constant loss. (Typical learning rate among other hyperparameters). Any help/update is much appreciated. |
Thanks @jrrw10 for this continued investigation. I tried training mistral and did not run into the errors you reported, only the fix with the import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from peft import LoraConfig, PrefixTuningConfig, get_peft_model
model_id = "mistralai/Mistral-7B-v0.1"
# "teknium/OpenHermes-2.5-Mistral-7B" works as well
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=0,
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=10)
model = get_peft_model(model, config)
def process(samples):
tokenized = tokenizer(samples["quote"], truncation=True, max_length=128)
return tokenized
data = load_dataset("ybelkada/english_quotes_copy")
data = data.map(process, batched=True)
trainer = Trainer(
model=model,
train_dataset=data["train"],
args=TrainingArguments(
num_train_epochs=1,
per_device_train_batch_size=4,
bf16=True,
learning_rate=3e-4,
logging_steps=10,
output_dir="/tmp/peft/869",
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train() This is the loss that I get:
Any idea what the difference could be? |
@BenjaminBossan I've found the issue/s I was experiencing. The difference between our implementations is that I am using gradient checkpointing like so:
This should be reproducible for you in your example above if you add that line. Here's the error:
Which is the exact error that I was experiencing before. However, when I manually fixed the attention mask size mismatch like I mentioned before, I was also met with the constant loss problem I showed above. I having problems reproducing this - maybe this was fixed in a recent commit I'm not sure. In conclusion - I got my implementation to work by adding the |
Thanks for digging deeper. Indeed, with gradient checkpointing, there is an issue. IIUC, adjusting the size of the causal mask is not the correct solution though, which could explain the bad losses you see. I printed the layer index, as well as the shapes of
The shapes are all correct except for the last line: We have sequence length of 31 before applying cache and 41 after applying cache, which is expected as we add 10 virtual tokens. Now let's check the last line. As we can see, it's again layer 31, i.e. it looks like we're in the gradient checkpointing phase. We can see the wrong length of 72 after updating. This is 31+41, i.e. the updated cache is added on top of the states, which IMO is not the correct way of handling this. It should be the same as during the first time we visited this layer, i.e. 31+10. I wonder if gradient checkpointing + cache is generally broken or if we're using it incorrectly. |
@jrrw10 FYI: #1962 (comment). So the workaround with creating a cache object to hold the |
@BenjaminBossan Thanks for the detailed explanation of this. Just to confirm, Prefix Tuning does successfully train with the |
Great that it works for you. I'd just be cautious with this as it's not using transformers as intended. This can be risky because:
|
@jrrw10 do you mind sharing the final script here? I'm having one issue where the training loss goes down but the performance is bad when I do test inference - I suspect it's because I don't apply the chat template during training and inference. Do you optimize for the prefix with the chat template or ? @BenjaminBossan do you know how your provided code could be modified to use chat models where templates needs to be applied? |
Did you try calling |
@BenjaminBossan I got it sorted out (: thank you for the reply. It seems what was pushing me back was a very weird error I got with llama models (e.g TinyLlama/TinyLlama-1.1B-Chat-v1.0) where model(**batch) where batch contains the input_ids, labels and attention_mask values was leading to a weird dimensions error.
Several people have reported having a similar error with no solution yet :-( |
You mean independent of PEFT? |
@BenjaminBossan yep! Seems something related to the modeling files. |
Just an update to all who encountered problems with this in the past: Could you please upgrade to the latest PEFT (v0.13.2+) and transformers versions (v4.45.2+) and report back if you're still encountering issues? |
See #869, #1962 Fix several issues caused by changes to cache in transformers. In particular, past_key_values for prefix tuning is now converted to a transformers Cache instance. --------- Co-authored-by: Raushan Turganbay <[email protected]>
@BenjaminBossan The error My env:
What I'm doing: train Qwen2.5-72B Instruct with Prefix-tuning My code: import peft
from transformers import AutoTokenizer, TrainingArguments, HfArgumentParser, AutoModelForCausalLM, Trainer
from dataclasses import dataclass, field
from peft import get_peft_model, PrefixTuningConfig
from typing import Dict, List
import warnings
import torch
import os
from sft_dataset import make_supervised_data_module, _print_rank
warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
@dataclass
class CustomizeArguments:
model_name_or_path: str = field(default=None)
tokenizer: str = field(default=None)
data_path: List[str] = field(default=None)
eval_path: str = field(default=None)
max_sft_length: int = field(default=None)
# Prefix-tuning
pft_enable: bool = field(default=False)
pft_num_virtual_tokens: int = field(default=None)
@dataclass
class TrainingArguments(TrainingArguments):
optim: str = field(default="adamw_torch")
def train():
parser = HfArgumentParser((TrainingArguments, CustomizeArguments))
training_args, customize_args = parser.parse_args_into_dataclasses()
tokenizer = AutoTokenizer.from_pretrained(customize_args.tokenizer, use_fast=False)
data_modules = make_supervised_data_module(tokenizer, customize_args.data_path, customize_args.eval_path,customize_args.max_sft_length or tokenizer.model_max_length)
model = AutoModelForCausalLM.from_pretrained(customize_args.model_name_or_path)
model.gradient_checkpointing_enable()
assert customize_args.lora_enable and not customize_args.pft_enable or not customize_args.lora_enable and customize_args.pft_enable or not (customize_args.lora_enable or customize_args.pft_enable), "Only one of lora and pft can be enabled."
if customize_args.pft_enable:
_print_rank("Prefix-tuning is enabled.")
pft_config = PrefixTuningConfig(
peft_type=peft.PeftType.PREFIX_TUNING,
inference_mode=False,
task_type="CAUSAL_LM",
num_virtual_tokens=customize_args.pft_num_virtual_tokens,
)
model = get_peft_model(model, pft_config)
model.print_trainable_parameters()
else:
_print_rank("Fine-tuning full parameters.")
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
**data_modules,
)
print("Start training...")
trainer.train()
model.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
if __name__ == "__main__":
train() The error information:
|
When I remove the line |
Thanks for reporting @DavdGao. What model did you use that led to the error? Could you perhaps share how you called the script? Also, we recently merged some fixes to PEFT, so if you could install from source and check again, that would be helpful to know. |
#!/bin/bash
set -e
export CUTLASS_PATH=/cpfs/data/gaodawei.gdw/cutlass
# task
path_data=${1}
path_model=${2}
path_tokenizer=${3}
# params
bs=${4}
lr=${5}
wd=${6}
epo=${7}
pft_enable=${8}
pft_num_virtual_tokens=${9}
ds_config=${10}
name_run=${11}
eval_dir=${12}
filename_with_extension=${path_data##*/}
filename_without_extension=${filename_with_extension%.*}
second_last_dir="${path_data%/*}"
second_last_dir="${second_last_dir##*/}"
third_last_dir="${path_data%/*/*}"
third_last_dir="${third_last_dir##*/}"
name_data="${third_last_dir}_${second_last_dir}_${filename_without_extension}"
name_run=${name_run}/${bs}bs_${lr}lr_${wd}wd_${epo}epo
if [ "$pft_enable" = "True" ]; then
name_run=${name_run}_${pft_num_virtual_tokens}vtoken
fi
echo "Task: $name_run"
path_save=/home/data/shared/checkpoints/prefix-tuning/${name_run}
mkdir -p ${path_save}
# wandb
WANDB_PROJECT=PFT
WANDB_NAME=${name_run}
ROOT=/cpfs/data/gaodawei.gdw/train
path_ds_config=${ds_config}
cd ${ROOT}
gas=$((${bs}/8))
deepspeed --num_gpus 8 --num_nodes 1 --master_port 5900 \
${ROOT}/sft.py \
--model_name_or_path ${path_model} \
--tokenizer ${path_tokenizer} \
--do_train \
--data_path ${path_data} \
--eval_strategy "no"\
--bf16 True \
--output_dir ${path_save} \
--num_train_epochs ${epo} \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps ${gas} \
--save_strategy "epoch" \
--save_total_limit 99999 \
--learning_rate ${lr} \
--weight_decay ${wd} \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--log_level info \
--tf32 True \
--wandb_project ${WANDB_PROJECT} \
--wandb_name ${WANDB_NAME} \
--deepspeed ${path_ds_config} \
--save_only_model True \
--pft_enable ${pft_enable} \
--pft_num_virtual_tokens ${pft_num_virtual_tokens} 2>&1 | tee ${path_save}/training_log.txt
#!/bin/bash
set -e
name_run=Prefix-Tuning_Qwen2.5-72B-Instruct_v1
path_data=/cpfs/data/gaodawei.gdw/data/train/v3
train_dir=${path_data}/split_train
eval_dir=${path_data}/split_val
path_model=/home/data/shared/checkpoints/qwen/qwen2.5/Qwen2.5-7B-Instruct
ds_config=/cpfs/data/gaodawei.gdw/scripts/sft/multi_node/configs/ds_config_stage0.json
path_tokenizer=${path_model}
global_batch_size=32
pft_enable=True
LRS=(1e-5 5e-5 1e-4)
WDS=(0.01)
EPOS=(10)
VTS=(10 50 100)
for epo in ${EPOS[*]};
do
for wd in ${WDS[*]};
do
for lr in ${LRS[*]};
do
for pft_num_virtual_tokens in ${VTS[*]};
do
bash /cpfs/data/gaodawei.gdw/train/scripts_prefix-tuning/basic.sh \
"${train_dir}" \
${path_model} \
${path_tokenizer} \
${global_batch_size} \
${lr} \
${wd} \
${epo} \
${pft_enable} \
${pft_num_virtual_tokens} \
${ds_config} \
${name_run} \
${eval_dir}
echo "######################################################################"
done
done
done
done |
I have tried to install the latest peft from source code. When I enable
Similar error occurs when I change the model to |
Thanks @DavdGao I can reproduce the error. I'll investigate and get back to you when I find something out. |
@BenjaminBossan Thank you for your assistance. I greatly appreciate it. Currently, I removed the line |
See huggingface#869 Since transformers is moving to the new cache implementation, we had to change prefix tuning to use this too. However, caching does not work with gradient checkpointing. Therefore, this currently runs into an error about size mismatches. Now, PEFT checks for gradient checkpointing and raises a helpful error.
@DavdGao After some investigation with colleagues, we came to the conclusion that unfortunately, prefix tuning won't work with gradient checkpointing. The reason is that transformers now made some changes to caching, which is reflected in prefix tuning now using |
See #869 Since transformers is moving to the new cache implementation, we had to change prefix tuning to use this too. However, caching does not work with gradient checkpointing. Therefore, this currently runs into an error about size mismatches. Now, PEFT checks for gradient checkpointing and raises a helpful error.
Thanks a lot. |
@DavdGao It is unfortunate that caching precludes the use of gradient checkpointing, thus resulting in higher memory usage. Not sure if you already tried quantization, but that should work with prefix tuning. |
System Info
peft 0.5.0
transformers 4.32.0
Who can help?
No response
Information
Tasks
examples
folderReproduction
Expected behavior
I'm assuming
num_layers
,num_attention_heads
, andtoken_dim
need to match the base model. In the samplenum_transformer_submodules
is 1. But encoder-decoder has two transformers right? Should this be 2?When I run the code above I got
When I print out the shape of
position_bias
andmask
.mask
has 100 more tokens thanposition_bias
seems like on the decoder side. It's also taking in the prefix embeddingsThe text was updated successfully, but these errors were encountered: