Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问可以多卡推理吗,单卡显存有限 #587

Open
shenshaowei opened this issue Sep 24, 2024 · 11 comments
Open

请问可以多卡推理吗,单卡显存有限 #587

shenshaowei opened this issue Sep 24, 2024 · 11 comments
Labels
question Further information is requested

Comments

@shenshaowei
Copy link

shenshaowei commented Sep 24, 2024

请问显存不够,只有4090D,bits设置16,如果想用多卡进行微调后的OneKe推理,有提供单机多卡推理代码吗?需要用什么命令执行?

CUDA_VISIBLE_DEVICES=0,1 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data/shensw/model/OneKE' \
    --checkpoint_dir '/data/shensw/DeepKE/example/llm/InstructKGC/lora/oneke_ner_cmeee_add/checkpoint-2810' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/ner_cmeee_add.json' \
    --output_file 'results/ner_cmeee_add.json' \
    --finetuning_type lora \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 16

单卡推理显存不够,报OOM了:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.65 GiB total capacity; 23.21 GiB already allocated; 33.88 MiB free; 23.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
期待您回复。

@shenshaowei shenshaowei added the question Further information is requested label Sep 24, 2024
@shenshaowei
Copy link
Author

shenshaowei commented Sep 24, 2024

你好,可以回答一下吗

@guihonghao
Copy link
Contributor

当前不支持多卡推理

@shenshaowei
Copy link
Author

好的谢谢,那提供训练后量化推理代码吗

@guihonghao
Copy link
Contributor

@shenshaowei
Copy link
Author

ok

@shenshaowei
Copy link
Author

output_dir='lora/oneke-bio-8-add'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=1287 src/finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path '/data/shensw/model/OneKE' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --train_file '/data/shensw/DeepKE/example/llm/InstructKGC/data/NER/bio8-data/bio_train.json' \
    --valid_file '/data/shensw/DeepKE/example/llm/InstructKGC/data/NER/bio8-data/bio_dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 8 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 300 \
    --cutoff_len 500 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --bf16 \
    --bits 8
训练完推理:
    CUDA_VISIBLE_DEVICES=0 python src/inference.py \
    --stage sft \
    --model_name_or_path '/data/shensw/model/OneKE' \
    --checkpoint_dir '/data/shensw/DeepKE/example/llm/InstructKGC/lora/oneke-bio-8-add/checkpoint-890' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file '/data/shensw/DeepKE/example/llm/InstructKGC/data/NER/bio8-data/bio_test.json' \
    --output_file 'results/bio_test.json' \
    --finetuning_type lora \
    --output_dir 'result' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16\
    --max_new_tokens 300 \
    --bits 8

报错: RuntimeError: expected scalar type Float but found BFloat16
请问这个是什么原因呢?
--bf16\改为--fp16\依旧报错

@shenshaowei
Copy link
Author

shenshaowei commented Sep 24, 2024

--bits 8
改为
--bits 4后可以运行了,看了代码,小于8,不同bits:4/8是选择不同的量化方式,都是bitesandbytes,难道跟版本有关系?
还有运行时-bits 4为什么速度慢--bits 162倍多,疑惑不解,求大佬解惑

@guihonghao
Copy link
Contributor

--bits 8有时会出现上面的报错,建议使用--bits 4

@shenshaowei
Copy link
Author

shenshaowei commented Sep 24, 2024

运行时-bits 4为什么速度慢--bits 16 两倍多呢,bits 16一条2s,bits 4一条反而要5s。正常量化不应该变快些吗,疑惑不解,求大佬解惑

@guihonghao
Copy link
Contributor

不清楚哪里出了什么问题

@shenshaowei
Copy link
Author

好吧,您在使用量化模型时不会变慢吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants