-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 / dlorver should adapts to the new accelerator type and implements a script something like Nvidia_gpu.py #1338
Comments
|
You can do this, but it is not recommended. This is because dlrover's checks are oriented towards training, so it can only provide possible issues, but it is not helpful for troubleshooting. |
Ok, thank you. We also hope that it can be used for the purpose of health detection before starting training |
They want to implement a script like |
Hello, I would like to ask our nvidia_gpu.py how to determine whether the node is bad or slow? Although looking at the code and documentation is still a bit vague, or there is no more detailed documentation |
The agent reports the check status and elapsed time. dlrover/dlrover/python/elastic_agent/torch/training.py Lines 1294 to 1315 in b3bf606
The master detects the straggler. dlrover/dlrover/python/master/elastic_training/rdzv_manager.py Lines 782 to 797 in b3bf606
|
If you does not find the directory
|
@lulu-0126 The agent will remove the result file of check script after getting the result from file. If the agent does not remove the file, the agent will decide the check pass in the next check. |
Thank you. I'll look into how to save it in another location |
想要通过Dlrover node_check功能检测国产卡的节点慢和故障问题
The text was updated successfully, but these errors were encountered: