Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AxeZhan
Copy link
Member

@AxeZhan AxeZhan commented May 25, 2024

  • One-line PR description: Guarantee PodDisruptionBudget When Preemption Happens.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 25, 2024
@@ -369,15 +369,11 @@ when selecting victims.
- if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod,
the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs
are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`
and the scheduler will fallback to deletion as an alternative.
using the client-go's delete method.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is confusing to me. Are we tring to introduce /evictions endpoint in this KEP ?

IIRC, we don't using eviction in preemption right now. https://github.com/kubernetes/kubernetes/blob/4a668bcf143c0652882d2e59051dcf0cd843c24c/pkg/scheduler/util/utils.go#L137

Besides, I don't think we should use /evictions endpoint here.
During the preemption, we'll separate victims into two parts: nonViolatingCandidates, violatingCandidates.

Since we already know in advance which pods' deletion would violate the PDB.

  1. Using eviction on nonViolatingCandidates will equal to delete it directly(?)
  2. Using eviction on violatingCandidates will always return 429. And since we have already deleted all nonViolatingCandidates on the same node before trying to delete violatingCandidates, this will results in all violatingCandidates can't be deleted and thus preemption will always fail when deleting all nonViolatingCandidates is not enough.

Besides, if we use /evictions endpoint for all pods in preemption. Then what's the point of this KEP? We'll deny any preempter if pdb is violated.
Also, use /evictions endpoint for all pods would also be a breaking change and is not backward compatible.
@alculquicondor @Huang-Wei

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the original issue to understand the drive behind this KEP.

Also take a look at this comment to understand the reason behind the eviction first policy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I read the original issue and comment you left in #3280

The drive behind the KEP and my understanding while reading it seem to be different. I thought the KEP was introducing a new field to make it harder for specific pods to be preempted during scheduling. However, it appears that the original intention is for the scheduler to respect PDBs in every preemptive case.

In this case, I agree that we should use eviction api in the whole preemption.

@AxeZhan
Copy link
Member Author

AxeZhan commented May 30, 2024

ping @alculquicondor @Huang-Wei

@wojtek-t wojtek-t self-assigned this Jun 3, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AxeZhan
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign huang-wei for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 5, 2024
Copy link
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

@@ -369,15 +369,11 @@ when selecting victims.
- if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod,
the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs
are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`
and the scheduler will fallback to deletion as an alternative.
using the client-go's delete method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the original issue to understand the drive behind this KEP.

Also take a look at this comment to understand the reason behind the eviction first policy.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2024
@AxeZhan
Copy link
Member Author

AxeZhan commented Jun 6, 2024

I still have a question to be clarified:

I think the design details in original KEP conflicts much with current implementation of preemption:
In current implementation, we dry run the preemption to get a list of candidates(each candidate is a struct with a nodeName and a group of victim pods belong to the node).

So how can we achieve this:

If it responses
429 Too Many Requests, the scheduler will output an error log and choose another victim among the candidate victims to preempt
until it succeeds or there are no more candidates.

Suppose dry run shows our best candidate is nodeA with Pod1,Pod2, and Pod3.

So we begin to process preemption:

  1. We evict Pod1 with status 200.
  2. We evict Pod2, but received a 429, what now? The original KEP says we should choose another victim, but in this case, even if we successfully evict Pod3, we can't make up enough space for preemptor on NodeA. Then what? Can we even revert eviction for Pod1? I don't think so.

I'm not clear of how we should handle this situation:

  1. Priority of the preemptor is less than the value of AllowDisruptionByPriorityGreaterThanOrEqual in victims.
  2. During dryRun, deleting pod2 will not violate its PDB.
  3. Some thing happens, and now deleting pod2 will violate its PDB.
  4. Preemption begins, and we received 429 when we trying to evict pod2, and since (1), we can not delete pod2 now also.
  5. Then what????

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you make sure that the PRR is up to date?

authors:
- "@denkensk"
- "AxeZhan"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "AxeZhan"
- "@AxeZhan"

Comment on lines +376 to +377
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`,
the scheduler will fallback to deletion as an alternative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why this was acceptable. Why not use DELETE directly, given that there is a high chance that the /evictions endpoint will just fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at #4665 (comment). I think we should properly describe the reasoning in the KEP to make it clear for anyone reading it.

@alculquicondor
Copy link
Member

I'm going on vacation today. Perhaps @Huang-Wei can take over?

For the most part, I don't have a problem with the KEP. Last time, however, the implementation was different from what the KEP presented, intending to cover more use cases, so it was rejected.
Please make sure that the design present in the KEP matches your expectations and use cases.

@AxeZhan
Copy link
Member Author

AxeZhan commented Jun 13, 2024

The api definition lgtm. However, I have a lot questions about the original implementations.

For example, this one:
#4665 (comment)

I don't think we can change the victim during preemption midway.

I think we still need to discuss about the implementations. And as the Enhancements Freeze is approaching. I don' think we can bring this in 1.31. We should have more discussion and try introduce it in 1.32 if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

None yet

6 participants