You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
以上是我的训练代码。
问题情况描述:当NPU数量设置为8,会一直卡在Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4535/4535 [00:00<00:00, 5891.58 examples/s]这一步,一直不推进。此时我取消操作,退出会显示:Traceback (most recent call last):
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1209, in wait
return self._wait(timeout=timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1959, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1917, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/python3.10.13/bin/deepspeed", line 6, in
main()
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1222, in wait
self._wait(timeout=sigint_timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1953, in _wait
time.sleep(delay)
KeyboardInterrupt
提问:
1.npu-smi info 显示有4号卡和6号卡健康状态显示warning。是否会影响训练。
2.NPU_VISIBLE_DEVICES是否能够起作用。即使我跳过了4卡和6卡,仍然显示WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]},并卡住在Converting format of dataset。
Reminder
System Info
[2024-12-24 14:39:49,908] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
llamafactory
version: 0.9.2.dev0Reproduction
NPU_VISIBLE_DEVICES="0,1,2,3,5,7" deepspeed --num_gpus 6 src/train.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--model_name_or_path /home/yunwei/LLaMA-Factory/Qwen2.5-1.5B-Instruct
--do_train
--dataset_dir /home/yunwei/LLaMA-Factory/sft_data
--dataset "aisp_dm_llm_dialogue_rectification"
--template qwen
--finetuning_type full
--output_dir saves/qwen2.5-1.5b/test/
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 5000
--learning_rate 1e-4
--num_train_epochs 2.0
--plot_loss
--bf16
以上是我的训练代码。
问题情况描述:当NPU数量设置为8,会一直卡在Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4535/4535 [00:00<00:00, 5891.58 examples/s]这一步,一直不推进。此时我取消操作,退出会显示:Traceback (most recent call last):
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1209, in wait
return self._wait(timeout=timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1959, in _wait
(pid, sts) = self._try_wait(0)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1917, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/python3.10.13/bin/deepspeed", line 6, in
main()
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1222, in wait
self._wait(timeout=sigint_timeout)
File "/usr/local/python3.10.13/lib/python3.10/subprocess.py", line 1953, in _wait
time.sleep(delay)
KeyboardInterrupt
提问:
1.npu-smi info 显示有4号卡和6号卡健康状态显示warning。是否会影响训练。
2.NPU_VISIBLE_DEVICES是否能够起作用。即使我跳过了4卡和6卡,仍然显示WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]},并卡住在Converting format of dataset。
3.我是用昇腾机器跑的,但deepspeed加载时显示[2024-12-24 14:35:17,414] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5。并且容器内观察npu-smi info发现进程并未加载到NPU上,deepspeed在NPU上使用需要修改哪些地方。
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: