You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we are using Python to implement the DLRover master. However, when compared with Golang, it has several disadvantages as follows:
The k8s Python client has inherent limitations and lags behind the Go (and Java) clients. For instance, it lacks an informer implementation (refer to Enhance/Replace k8s python client. #1291).
There are more Golang packages available for custom resource definitions (CRDs) such as volcano/PodGroup and kubeflow/TrainingJob. In contrast, we can only use a Python dictionary to contain it. Moreover, the code readability of using a Python dictionary is rather poor, as can be seen in the code at
Due to the Global Interpreter Lock (GIL), the execution efficiency of Python is lower than Golang. With Golang, we can utilize go-routine to accelerate the process of launching thousands of Pods. We will replace the gRPC with an HTTP service as the latter is more compatible. Additionally, the Golang HTTP implementation is more efficient than Python
The master will be handling an increasing amount of data analysis work, like fault diagnosis, where the execution speed is crucial.
Design
In the first stage, the Golang master will only support the Pytorch allreduce job and incorporate elastic training service as well as elastic scheduling.
Elastic Training Service
Node Discovery: The node discovery mechanism gathers information about the alive nodes where the dlrover-run started. It then assigns a node rank and the number of nodes to each started node. Furthermore, it can sort the node rank based on the switch topology of the nodes to optimize traffic communication.
Training Metrics Collector: This collector gathers the runtime metrics of each node. These runtime metrics includes the CPU and GPU workload, TCP and RDMA traffic, and the training profiling data obtained from the training frameworks.
Training Detector: The training detection function checks whether any exceptions occur during the training process, such as hangs or breakdowns. If an exception does happen, it can identify the root cause of the issue, like whether it's due to the breakdown of a node's GPU or network.
Elastic Scheduling
Auto Scaler: The auto scaler generates a scaling plan with the resource configuration of the nodes, including details like the number of nodes, GPU type, and so on. In case a node fails, the auto-scaler can create a new plan to remove the failed node and launch a new one within the cluster.
Node Scheduler: The scheduler is responsible for launching or removing nodes in accordance with the scaling plan. Different schedulers can be implemented to support various CRDs. For example, we can implement an elastic scheduler to launch nodes one by one, similar to the DLRover Python scheduler
Background
Currently, we are using Python to implement the DLRover master. However, when compared with Golang, it has several disadvantages as follows:
dlrover/dlrover/python/scheduler/kubernetes.py
Lines 407 to 462 in 5d070f4
Design
In the first stage, the Golang master will only support the Pytorch allreduce job and incorporate elastic training service as well as elastic scheduling.
Elastic Training Service
dlrover-run
started. It then assigns a node rank and the number of nodes to each started node. Furthermore, it can sort the node rank based on the switch topology of the nodes to optimize traffic communication.Elastic Scheduling
dlrover/dlrover/python/master/scaler/pod_scaler.py
Lines 417 to 452 in 5d070f4
The text was updated successfully, but these errors were encountered: