Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minicpm-v-v2_5-chat 微调显存溢出 #1286

Open
lyc728 opened this issue Jul 3, 2024 · 7 comments
Open

minicpm-v-v2_5-chat 微调显存溢出 #1286

lyc728 opened this issue Jul 3, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@lyc728
Copy link

lyc728 commented Jul 3, 2024

训练脚本

--model_type minicpm-v-v2_5-chat \
    --model_id_or_path /data/MiniCPM-V/pretrained/MiniCPM-Llama3-V-2_5 \
    --dataset /data/swift/finetune/train_0703.jsonl \
    --ddp_find_unused_parameters true \
    --output_dir /data/swift/us_desc/ \
    --batch_size 1 \
    --lora_target_modules ALL \
    --gradient_accumulation_steps 16 \
    --gradient_checkpointing true \
    --use_flash_attn true \
    --num_train_epochs 5 \
    --save_strategy "steps" \
    --save_steps 500 \
    --sft_type "lora" \
    --save_total_limit 2 \
    --ddp_backend nccl \
    --save_only_model false \
企业微信截图_17199984306510
@babla9
Copy link

babla9 commented Jul 3, 2024

I had the same issue occur on an 80gb A100.

@Jintao-Huang
Copy link
Collaborator

这个问题有点难解, 一定要训练vision encoder部分吗

使用的是main分支么

@Jintao-Huang Jintao-Huang added the bug Something isn't working label Jul 4, 2024
@lyc728
Copy link
Author

lyc728 commented Jul 4, 2024

sion encoder部

你昨天说把--lora_target_modules ALL \这个注释是一样的,走到验证就会溢出,使用的是main分支,才拉的

@lyc728
Copy link
Author

lyc728 commented Jul 4, 2024

指定多块GPU CUDA_VISIBLE_DEVICES=0,1,2,3

指定了的

@KasLoot
Copy link

KasLoot commented Jul 4, 2024

Same problem and I posted a issue report on both github and wechat group weeks ago and no one response...

Anyway, I have figured out a way to "fix" this problem: set --evaluation_strategy no --batch_size 32. Any train batch size not 32 and has evaluation process will trigger the program to use 10 GB * eval_batch_size GPU resource. And do eliminate the evaluation process because it loads 30GB+ GPU memory from nowhere when finishing evluation process.

Another issuse when tranning is that: even though the above setup can fix the issue temporarily, it can still have memory leak every 50 steps when the program is trying to calculate some sort of GPU usage and epoch number kind of stuff.

@babla9
Copy link

babla9 commented Jul 9, 2024

@Jintao-Huang @tastelikefeet any thoughts on what the issue could be here? This is one of the better performance for gpu size tradeoffs models, so would be great to fine-tune.

@lyc728
Copy link
Author

lyc728 commented Jul 11, 2024

现在发现新的问题,我有一个能训练的数据集,现在对数据集进行采样了一部分,发现训练开始时就会超显存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants