Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 / dlorver should adapts to the new accelerator type and implements a script something like Nvidia_gpu.py #1338

Open
lulu-0126 opened this issue Nov 15, 2024 · 9 comments
Labels
question Further information is requested

Comments

@lulu-0126
Copy link

想要通过Dlrover node_check功能检测国产卡的节点慢和故障问题

@workingloong
Copy link
Collaborator

  1. 你可以参考这个 nvidia_gpu.py 实现对应的检测脚本 https://github.com/intelligent-machine-learning/dlrover/blob/master/dlrover/trainer/torch/node_check/nvidia_gpu.py
  2. 在这里加个 accelerator 类型,指定你新实现的脚本。https://github.com/intelligent-machine-learning/dlrover/blob/master/dlrover/python/elastic_agent/torch/training.py#L1500-L1503
  3. 在 elastic_run.py 加上对应的accelerator。https://github.com/intelligent-machine-learning/dlrover/blob/master/dlrover/trainer/torch/elastic_run.py#L178-L183
  4. dlrover-run 启动的时候指定 --accelerator 即可,dlrover-run --accelerator=xxx

@BalaBalaYi
Copy link
Collaborator

You can do this, but it is not recommended. This is because dlrover's checks are oriented towards training, so it can only provide possible issues, but it is not helpful for troubleshooting.

@BalaBalaYi BalaBalaYi added the question Further information is requested label Nov 18, 2024
@lulu-0126
Copy link
Author

Ok, thank you. We also hope that it can be used for the purpose of health detection before starting training

@workingloong
Copy link
Collaborator

You can do this, but it is not recommended. This is because dlrover's checks are oriented towards training, so it can only provide possible issues, but it is not helpful for troubleshooting.

They want to implement a script like node_check/nvidia.py to check the node health before starting training.

@lulu-0126
Copy link
Author

Hello, I would like to ask our nvidia_gpu.py how to determine whether the node is bad or slow? Although looking at the code and documentation is still a bit vague, or there is no more detailed documentation

@workingloong
Copy link
Collaborator

workingloong commented Nov 21, 2024

Hello, I would like to ask our nvidia_gpu.py how to determine whether the node is bad or slow? Although looking at the code and documentation is still a bit vague, or there is no more detailed documentation

The agent reports the check status and elapsed time.

result, elapsed_time = self._run_node_check()
elapsed_time = round(elapsed_time, 3)
logger.info(
f"Network check time of round {i} is {elapsed_time}"
f" and succeed is {result}."
)
status = (
NodeEventType.NODE_CHECK_SUCCEEDED
if result
else NodeEventType.NODE_CHECK_FAILED
)
self._client.report_network_check_status(
self._node_rank,
status,
elapsed_time,
)
success = success or result
fault_nodes = self._client.check_fault_node()
stragglers = self._client.check_straggler()
logger.info(
f"Fault nodes are: {fault_nodes} "
f" and stragglers are: {stragglers}."

The master detects the straggler.

def _detect_stragglers(self):
"""Detect whether there is the straggler in the job."""
stragglers: Dict[int, float] = {}
times = sorted(list(self._node_times.values()))
if not times:
return stragglers
if len(times) % 2 == 0:
i = len(times) // 2
med_time = (times[i] + times[i - 1]) / 2
else:
i = len(times) // 2
med_time = times[i]
for node_id, t in self._node_times.items():
if t > med_time * 2:
stragglers[node_id] = t
return stragglers

@workingloong
Copy link
Collaborator

If you does not find the directory /tmp/dlrover/node_check, you may need to check whether you use the decorator record_execution_time.

@workingloong
Copy link
Collaborator

@lulu-0126 The agent will remove the result file of check script after getting the result from file. If the agent does not remove the file, the agent will decide the check pass in the next check.
https://github.com/intelligent-machine-learning/dlrover/blob/c639de186e7e81193d16ebb2aba8dc0049c23d9f/dlrover/python/elastic_agent/torch/training.py#L1387-L1399C1

@lulu-0126
Copy link
Author

Thank you. I'll look into how to save it in another location

@mingcheng mingcheng changed the title dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 / dlorver should adapts to the new accelerator type and implements a script something like Nvidia_gpu.py Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants