Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are two threads to send heartbeat in the Monitor and DiagnosisAgent. #1413

Open
workingloong opened this issue Dec 29, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@workingloong
Copy link
Collaborator

workingloong commented Dec 29, 2024

The elastic training agent only need to start one thread to send heartbeat. We need to remove the duplicated codes.

def send_heartbeat(self):
try:
ts = int(time.time())
action = self._client.report_heart_beat(ts)
self._agent_context.enqueue_diagnosis_action(action)
except Exception as e:
logger.warning(f"fail to report a heartbeat: {e}")
def _periodically_report(self):
logger.info("Start diagnosis agent reporter.")
while True:
self.send_heartbeat()
time.sleep(15)

def send_heartbeat(self):
try:
ts = int(time.time())
action = self._master_client.report_heart_beat(ts)
if action:
pass
except Exception:
logger.warning("Fail to report a heartbeat.")
def _periodically_report(self):
logger.info("Start training agent reporter.")
while True:
if self._group_rank == 0:
self.report_resource_with_step()
self.send_heartbeat()
time.sleep(15)

@workingloong workingloong added the question Further information is requested label Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants