Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler leader election #841

Open
sharnoff opened this issue Mar 2, 2024 · 0 comments
Open

Scheduler leader election #841

sharnoff opened this issue Mar 2, 2024 · 0 comments
Labels
a/reliability Area: relates to reliability of the service c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/feature Issue type: feature, for new features or requests

Comments

@sharnoff
Copy link
Member

sharnoff commented Mar 2, 2024

Problem description / Motivation

Similar to #762, we only run a single instance of the scheduler at a time, which means we're vulnerable to extended outages if a node goes down. A "simple" way to fix this is via leader election.

Currently this is unsound, and is unlikely to work correctly.

Feature idea(s) / DoD

Scheduler supports leader election, for high availability in case of single node failure.

Scheduler should probably also have anti-affinity with itself (not sure if that's already provided with replicaset / deployment).

Implementation ideas

In addition to the changes to the deployment yaml, we also should adapt the scheduler plugin in some way so that its state is discarded when it's no longer the leader. Not sure how much work this is, or how we can get that signal.

Alternatively, if the pod/VM/node listing on startup is too expensive, we can modify the plugin so that having decisions made without its input is actually sound (within reason).

We also need to adapt the autoscaler-agent to be able to handle multiple scheduler instances — or expose a connection to the current leader via k8s service, or something. Not sure if that's possible.

@sharnoff sharnoff added a/reliability Area: relates to reliability of the service t/feature Issue type: feature, for new features or requests c/autoscaling/scheduler Component: autoscaling: k8s scheduler labels Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/reliability Area: relates to reliability of the service c/autoscaling/scheduler Component: autoscaling: k8s scheduler t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

1 participant