-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minicpm-v-v2_5-chat 微调显存溢出 #1286
Comments
I had the same issue occur on an 80gb A100. |
这个问题有点难解, 一定要训练vision encoder部分吗 使用的是main分支么 |
你昨天说把--lora_target_modules ALL \这个注释是一样的,走到验证就会溢出,使用的是main分支,才拉的 |
指定了的 |
Same problem and I posted a issue report on both github and wechat group weeks ago and no one response... Anyway, I have figured out a way to "fix" this problem: set --evaluation_strategy no --batch_size 32. Any train batch size not 32 and has evaluation process will trigger the program to use 10 GB * eval_batch_size GPU resource. And do eliminate the evaluation process because it loads 30GB+ GPU memory from nowhere when finishing evluation process. Another issuse when tranning is that: even though the above setup can fix the issue temporarily, it can still have memory leak every 50 steps when the program is trying to calculate some sort of GPU usage and epoch number kind of stuff. |
@Jintao-Huang @tastelikefeet any thoughts on what the issue could be here? This is one of the better performance for gpu size tradeoffs models, so would be great to fine-tune. |
现在发现新的问题,我有一个能训练的数据集,现在对数据集进行采样了一部分,发现训练开始时就会超显存 |
训练脚本
The text was updated successfully, but these errors were encountered: