Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FedScale Core] Error handling when network package dropped #208

Open
continue-revolution opened this issue Feb 17, 2023 · 2 comments
Open
Labels
bug Something isn't working

Comments

@continue-revolution
Copy link
Contributor

continue-revolution commented Feb 17, 2023

What happened + What you expected to happen

I've noticed some common problems when network package dropped in real depolyment, and I have some proposal regarding these problems. I've discussed with @fanlai0990, and I would like to hear from more contributors to figure out the best plan. @mosharaf @AmberLJC @ewenw @IKACE

  1. problem: server->client UPDATE_MODEL package dropped, server->client MODEL_TEST in error (stale model/no model)
    solution: ignore UPDATE_MODEL, send model in MODEL_TEST package
  2. problem: server->client CLIENT_TRAIN package dropped, server->client DUMMY_EVENT forever
    solution: keep event inside queue until client confirm event completed
    pitfall:
    • multi-thread executor may ping the same event more than once
    • UPDATE_MODEL no confirmation, no way to tell if UPDATE_MODEL finished

Versions / Dependencies

fedscale-0.5
server: ubuntu 16
client: android 23

Reproduction script

Issue Severity

High: It blocks me from completing my task.

@continue-revolution continue-revolution added the bug Something isn't working label Feb 17, 2023
@fanlai0990
Copy link
Member

In the future, we need to collect the piggyback information of each call before popping the event queue.

@IKACE
Copy link
Contributor

IKACE commented Feb 21, 2023

Hi Chengsong,
Good observation on the package drop!

  1. I think sending model in MODEL_TEST makes sense to me.
  2. Related to the pitfall, if in real deployment then each executor is one client, then keeping event in the queue should be okay? In terms of UPDATE_MODEL confirmation, isn't this line doing the confirmation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants