minicpm-v-v2_5-chat 微调显存溢出 #1286

lyc728 · 2024-07-03T09:21:15Z

训练脚本

--model_type minicpm-v-v2_5-chat \
    --model_id_or_path /data/MiniCPM-V/pretrained/MiniCPM-Llama3-V-2_5 \
    --dataset /data/swift/finetune/train_0703.jsonl \
    --ddp_find_unused_parameters true \
    --output_dir /data/swift/us_desc/ \
    --batch_size 1 \
    --lora_target_modules ALL \
    --gradient_accumulation_steps 16 \
    --gradient_checkpointing true \
    --use_flash_attn true \
    --num_train_epochs 5 \
    --save_strategy "steps" \
    --save_steps 500 \
    --sft_type "lora" \
    --save_total_limit 2 \
    --ddp_backend nccl \
    --save_only_model false \

The text was updated successfully, but these errors were encountered:

babla9 · 2024-07-03T19:23:01Z

I had the same issue occur on an 80gb A100.

Jintao-Huang · 2024-07-04T07:20:11Z

这个问题有点难解, 一定要训练vision encoder部分吗

使用的是main分支么

lyc728 · 2024-07-04T07:22:37Z

sion encoder部

你昨天说把--lora_target_modules ALL \这个注释是一样的，走到验证就会溢出，使用的是main分支，才拉的

lyc728 · 2024-07-04T08:11:22Z

指定多块GPU CUDA_VISIBLE_DEVICES=0,1,2,3

指定了的

KasLoot · 2024-07-04T08:37:39Z

Same problem and I posted a issue report on both github and wechat group weeks ago and no one response...

Anyway, I have figured out a way to "fix" this problem: set --evaluation_strategy no --batch_size 32. Any train batch size not 32 and has evaluation process will trigger the program to use 10 GB * eval_batch_size GPU resource. And do eliminate the evaluation process because it loads 30GB+ GPU memory from nowhere when finishing evluation process.

Another issuse when tranning is that: even though the above setup can fix the issue temporarily, it can still have memory leak every 50 steps when the program is trying to calculate some sort of GPU usage and epoch number kind of stuff.

babla9 · 2024-07-09T19:17:05Z

@Jintao-Huang @tastelikefeet any thoughts on what the issue could be here? This is one of the better performance for gpu size tradeoffs models, so would be great to fine-tune.

lyc728 · 2024-07-11T02:11:15Z

现在发现新的问题，我有一个能训练的数据集，现在对数据集进行采样了一部分，发现训练开始时就会超显存

Jintao-Huang added the bug Something isn't working label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minicpm-v-v2_5-chat 微调显存溢出 #1286

minicpm-v-v2_5-chat 微调显存溢出 #1286

lyc728 commented Jul 3, 2024

babla9 commented Jul 3, 2024

Jintao-Huang commented Jul 4, 2024

lyc728 commented Jul 4, 2024

lyc728 commented Jul 4, 2024

KasLoot commented Jul 4, 2024

babla9 commented Jul 9, 2024

lyc728 commented Jul 11, 2024

minicpm-v-v2_5-chat 微调显存溢出 #1286

minicpm-v-v2_5-chat 微调显存溢出 #1286

Comments

lyc728 commented Jul 3, 2024

babla9 commented Jul 3, 2024

Jintao-Huang commented Jul 4, 2024

lyc728 commented Jul 4, 2024

lyc728 commented Jul 4, 2024

KasLoot commented Jul 4, 2024

babla9 commented Jul 9, 2024

lyc728 commented Jul 11, 2024