-
Notifications
You must be signed in to change notification settings - Fork 237
Arch
Parameter server nodes are grouped into a server group and one or several worker groups.
A server node in the server group maintains a partition of the globally shared parameters. Server nodes communicate with each other to replicate and/or to migrate parameters for reliability and scaling.
Each worker group runs an application. A worker typically stores locally a
portion of the training data to compute local statistics such as
gradients. Workers communicate only with the server nodes (not among
themselves), updating and retrieving the shared parameters via push
and
pull
.
There is a scheduler node for each worker group. It assigns tasks to workers and monitors their progress. If workers are added or removed, it reschedules unfinished tasks.