Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 无法多卡并行评测数据 #1755

Open
2 tasks done
luhairong11 opened this issue Dec 11, 2024 · 3 comments
Open
2 tasks done

[Bug] 无法多卡并行评测数据 #1755

luhairong11 opened this issue Dec 11, 2024 · 3 comments
Assignees

Comments

@luhairong11
Copy link

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

命令

CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2

执行上述模型文件,应该是运行下面路径的这个配置

./opencompass-main/opencompass/configs/models/qwen2_5/vllm_qwen2_5_0_5b_instruct.py
image

GPU只占用了一张卡

image

重现问题 - 代码/配置示例

见上述

重现问题 - 命令或脚本

见上述

重现问题 - 错误信息

见上述

其他信息

No response

@luhairong11
Copy link
Author

如果用下面命令进行多卡数据并行,单模型,多数据,是可以得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4

但是用下面命令进行多卡数据并行,多模型,多数据,不能得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct vllm_qwen2_5_3b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4
通过查看日志:

  1. 首先会进行vllm_qwen2_5_0_5b_instruct模型的推理,使用的是4卡进行数据并行推理,这部分正常,
  2. 然后会执行vllm_qwen2_5_3b_instruct模型的推理,希望也是使用4卡进行数据并行推理,可实际上只使用了2卡,
    下面是部分截图
    一开始显示占用的是5,6显卡,后面又显示占用的是4,7显卡
    image
    image
    下面是predictions文件夹显示的信息,显示占用的是5,6显卡的时候,predictions文件夹内会接着生成*_1,_3.json的文件,
    当上面图片显示占用的是4.7显卡的时候,predictions文件夹内会接着生成
    _0,*_2.json的文件
    image
    image

疑问

为何脚本执行vllm_qwen2_5_3b_instruct 推理的时候显卡只用了2张,预期是希望4张卡进行数据并行

@Shiquan0304
Copy link

如果用下面命令进行多卡数据并行,单模型,多数据,是可以得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4

但是用下面命令进行多卡数据并行,多模型,多数据,不能得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct vllm_qwen2_5_3b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4 通过查看日志:

  1. 首先会进行vllm_qwen2_5_0_5b_instruct模型的推理,使用的是4卡进行数据并行推理,这部分正常,
  2. 然后会执行vllm_qwen2_5_3b_instruct模型的推理,希望也是使用4卡进行数据并行推理,可实际上只使用了2卡,
    下面是部分截图
    一开始显示占用的是5,6显卡,后面又显示占用的是4,7显卡
    image
    image
    下面是predictions文件夹显示的信息,显示占用的是5,6显卡的时候,predictions文件夹内会接着生成*_1,__3.json的文件,
    当上面图片显示占用的是4.7显卡的时候,predictions文件夹内会接着生成__0,*_2.json的文件
    image
    image

疑问

为何脚本执行vllm_qwen2_5_3b_instruct 推理的时候显卡只用了2张,预期是希望4张卡进行数据并行

你多模型也能跑起来吗?我的多模型一直报错,每次运行到第二个模型的时候就会卡住,最后包连接错误的问题。
![image](https://github.com/user-attachments/assets/e0b77131-976e-4c44-9b04-e2d576f4b643
[2024-12-26 01:14:33,214] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-26 01:14:33,215] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it, est. speed input: 40.86 toks/s, output: 470.37 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:03<00:00, 1.22it/s, est. speed input: 58.39 toks/s, output: 552.45 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it, est. speed input: 56.83 toks/s, output: 449.85 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.48s/it, est. speed input: 64.75 toks/s, output: 536.68 toks/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:20<00:00, 5.24s/it]
12/26 01:14:54 - OpenCompass - INFO - Task [qwen-7b-sft-vllm/demo_gsm8k_0,qwen-7b-sft-vllm/demo_math_0]
INFO 12-26 01:14:54 config.py:478] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 12-26 01:14:54 config.py:1216] Defaulting to use mp for distributed inference
INFO 12-26 01:14:54 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/', speculative_config=None, tokenizer='/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
/home/qiushiquan/opencompass/opencompass/init.py:19: UserWarning: Starting from v0.4.0, all AMOTIC configuration files currently located in ./configs/datasets, ./configs/models, and ./configs/summarizers will be migrated to the opencompass/configs/ package. Please update your configuration file paths accordingly.
_warn_about_config_migration()
(VllmWorkerProcess pid=1531428) INFO 12-26 01:15:00 selector.py:120] Using Flash Attention backend.
(VllmWorkerProcess pid=1531428) INFO 12-26 01:15:00 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
[E1226 01:24:20.242697548 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (127.0.0.1, 37915).
[W1226 01:24:20.243528552 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:37915 - retrying (try=0, timeout=600000ms, delay=38719ms): The client socket has timed out after 600000ms while trying to connect to (127.0.0.1, 37915).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd14b36c446 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x15e04c6 (0x7fd13654d4c6 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x6029d95 (0x7fd13af96d95 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x6029f36 (0x7fd13af96f36 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: + 0x602a3a4 (0x7fd13af973a4 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x5fe8016 (0x7fd13af55016 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x7fd13af57f7c in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0xd9acdd (0x7fd14a93acdd in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x4cb474 (0x7fd14a06b474 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x172df4 (0x56411107cdf4 in /data0/miniconda3/envs/opencompass/bin/python)
frame #10: _PyObject_MakeTpCall + 0x1f8 (0x564111043db8 in /data0/miniconda3/envs/opencompass/bin/python)
frame #11: + 0xeb5a7 (0x564110ff55a7 in /data0/miniconda3/envs/opencompass/bin/python)
frame #12: _PyObject_Call + 0x295 (0x56411104a495 in /data0/miniconda3/envs/opencompass/bin/python)
frame #13: + 0xb87fc (0x564110fc27fc in /data0/miniconda3/envs/opencompass/bin/python)
frame #14: + 0x153a21 (0x56411105da21 in /data0/miniconda3/envs/opencompass/bin/python)
frame #15: + 0x4c9ccb (0x7fd14a069ccb in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: _PyObject_MakeTpCall + 0x1f8 (0x564111043db8 in /data0/miniconda3/envs/opencompass/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x15b1 (0x5641110e3fb1 in /data0/miniconda3/envs/opencompass/bin/python)
frame #18: + 0x1871eb (0x5641110911eb in /data0/miniconda3/envs/opencompass/bin/python)
frame #19: + 0x105472 (0x56411100f472 in /data0/miniconda3/envs/opencompass/bin/python)
frame #20: + 0x18bb4c (0x564111095b4c in /data0/miniconda3/envs/opencompass/bin/python)

@luhairong11
Copy link
Author

如果用下面命令进行多卡数据并行,单模型,多数据,是可以得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4

但是用下面命令进行多卡数据并行,多模型,多数据,不能得到预期

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 run.py --models vllm_qwen2_5_0_5b_instruct vllm_qwen2_5_3b_instruct --datasets triviaqa_gen bbh_gen --max-num-worker 4 通过查看日志:

  1. 首先会进行vllm_qwen2_5_0_5b_instruct模型的推理,使用的是4卡进行数据并行推理,这部分正常,
  2. 然后会执行vllm_qwen2_5_3b_instruct模型的推理,希望也是使用4卡进行数据并行推理,可实际上只使用了2卡,
    下面是部分截图
    一开始显示占用的是5,6显卡,后面又显示占用的是4,7显卡
    image
    image
    下面是predictions文件夹显示的信息,显示占用的是5,6显卡的时候,predictions文件夹内会接着生成*_1,__3.json的文件,
    当上面图片显示占用的是4.7显卡的时候,predictions文件夹内会接着生成__0,*_2.json的文件
    image
    image

疑问

为何脚本执行vllm_qwen2_5_3b_instruct 推理的时候显卡只用了2张,预期是希望4张卡进行数据并行

你多模型也能跑起来吗?我的多模型一直报错,每次运行到第二个模型的时候就会卡住,最后包连接错误的问题。 ![image](https://github.com/user-attachments/assets/e0b77131-976e-4c44-9b04-e2d576f4b643 [2024-12-26 01:14:33,214] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader [2024-12-26 01:14:33,215] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process... Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it, est. speed input: 40.86 toks/s, output: 470.37 toks/s] Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:03<00:00, 1.22it/s, est. speed input: 58.39 toks/s, output: 552.45 toks/s] Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it, est. speed input: 56.83 toks/s, output: 449.85 toks/s] Processed prompts: 100%|███████████████████████████████████████████████| 4/4 [00:05<00:00, 1.48s/it, est. speed input: 64.75 toks/s, output: 536.68 toks/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:20<00:00, 5.24s/it] 12/26 01:14:54 - OpenCompass - INFO - Task [qwen-7b-sft-vllm/demo_gsm8k_0,qwen-7b-sft-vllm/demo_math_0] INFO 12-26 01:14:54 config.py:478] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'. INFO 12-26 01:14:54 config.py:1216] Defaulting to use mp for distributed inference INFO 12-26 01:14:54 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/', speculative_config=None, tokenizer='/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data3/train_result/llamafactory/qwen2.5-7b-instruct/full/sft/checkpoint-5500/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, /home/qiushiquan/opencompass/opencompass/init.py:19: UserWarning: Starting from v0.4.0, all AMOTIC configuration files currently located in ./configs/datasets, ./configs/models, and ./configs/summarizers will be migrated to the opencompass/configs/ package. Please update your configuration file paths accordingly. _warn_about_config_migration() (VllmWorkerProcess pid=1531428) INFO 12-26 01:15:00 selector.py:120] Using Flash Attention backend. (VllmWorkerProcess pid=1531428) INFO 12-26 01:15:00 multiproc_worker_utils.py:222] Worker ready; awaiting tasks [E1226 01:24:20.242697548 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (127.0.0.1, 37915). [W1226 01:24:20.243528552 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:37915 - retrying (try=0, timeout=600000ms, delay=38719ms): The client socket has timed out after 600000ms while trying to connect to (127.0.0.1, 37915). Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd14b36c446 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0x15e04c6 (0x7fd13654d4c6 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #2: + 0x6029d95 (0x7fd13af96d95 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #3: + 0x6029f36 (0x7fd13af96f36 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #4: + 0x602a3a4 (0x7fd13af973a4 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #5: + 0x5fe8016 (0x7fd13af55016 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x7fd13af57f7c in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xd9acdd (0x7fd14a93acdd in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #8: + 0x4cb474 (0x7fd14a06b474 in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #9: + 0x172df4 (0x56411107cdf4 in /data0/miniconda3/envs/opencompass/bin/python) frame #10: _PyObject_MakeTpCall + 0x1f8 (0x564111043db8 in /data0/miniconda3/envs/opencompass/bin/python) frame #11: + 0xeb5a7 (0x564110ff55a7 in /data0/miniconda3/envs/opencompass/bin/python) frame #12: _PyObject_Call + 0x295 (0x56411104a495 in /data0/miniconda3/envs/opencompass/bin/python) frame #13: + 0xb87fc (0x564110fc27fc in /data0/miniconda3/envs/opencompass/bin/python) frame #14: + 0x153a21 (0x56411105da21 in /data0/miniconda3/envs/opencompass/bin/python) frame #15: + 0x4c9ccb (0x7fd14a069ccb in /data0/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #16: _PyObject_MakeTpCall + 0x1f8 (0x564111043db8 in /data0/miniconda3/envs/opencompass/bin/python) frame #17: _PyEval_EvalFrameDefault + 0x15b1 (0x5641110e3fb1 in /data0/miniconda3/envs/opencompass/bin/python) frame #18: + 0x1871eb (0x5641110911eb in /data0/miniconda3/envs/opencompass/bin/python) frame #19: + 0x105472 (0x56411100f472 in /data0/miniconda3/envs/opencompass/bin/python) frame #20: + 0x18bb4c (0x564111095b4c in /data0/miniconda3/envs/opencompass/bin/python)

感觉你这个是网络超时之类的问题,用的torch版本是哪个,或者重新搭建一下环境

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants