-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665
base: master
Are you sure you want to change the base?
Conversation
@@ -369,15 +369,11 @@ when selecting victims. | |||
- if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod, | |||
the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs | |||
are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted | |||
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests` | |||
and the scheduler will fallback to deletion as an alternative. | |||
using the client-go's delete method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is confusing to me. Are we tring to introduce /evictions
endpoint in this KEP ?
IIRC, we don't using eviction in preemption right now. https://github.com/kubernetes/kubernetes/blob/4a668bcf143c0652882d2e59051dcf0cd843c24c/pkg/scheduler/util/utils.go#L137
Besides, I don't think we should use /evictions
endpoint here.
During the preemption, we'll separate victims into two parts: nonViolatingCandidates, violatingCandidates.
Since we already know in advance which pods' deletion would violate the PDB.
- Using
eviction
on nonViolatingCandidates will equal to delete it directly(?) - Using
eviction
on violatingCandidates will always return 429. And since we have already deleted all nonViolatingCandidates on the same node before trying to delete violatingCandidates, this will results in all violatingCandidates can't be deleted and thus preemption will always fail when deleting all nonViolatingCandidates is not enough.
Besides, if we use /evictions
endpoint for all pods in preemption. Then what's the point of this KEP? We'll deny any preempter if pdb is violated.
Also, use /evictions
endpoint for all pods would also be a breaking change and is not backward compatible.
@alculquicondor @Huang-Wei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the original issue to understand the drive behind this KEP.
Also take a look at this comment to understand the reason behind the eviction first policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I read the original issue and comment you left in #3280
The drive behind the KEP and my understanding while reading it seem to be different. I thought the KEP was introducing a new field to make it harder for specific pods to be preempted during scheduling. However, it appears that the original intention is for the scheduler to respect PDBs in every preemptive case.
In this case, I agree that we should use eviction api in the whole preemption.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: AxeZhan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
@@ -369,15 +369,11 @@ when selecting victims. | |||
- if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod, | |||
the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs | |||
are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted | |||
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests` | |||
and the scheduler will fallback to deletion as an alternative. | |||
using the client-go's delete method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the original issue to understand the drive behind this KEP.
Also take a look at this comment to understand the reason behind the eviction first policy.
keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/README.md
Outdated
Show resolved
Hide resolved
I still have a question to be clarified: I think the design details in original KEP conflicts much with current implementation of preemption: So how can we achieve this:
Suppose dry run shows our best candidate is nodeA with Pod1,Pod2, and Pod3. So we begin to process preemption:
I'm not clear of how we should handle this situation:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you make sure that the PRR is up to date?
authors: | ||
- "@denkensk" | ||
- "AxeZhan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "AxeZhan" | |
- "@AxeZhan" |
via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`, | ||
the scheduler will fallback to deletion as an alternative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember why this was acceptable. Why not use DELETE directly, given that there is a high chance that the /evictions
endpoint will just fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at #4665 (comment). I think we should properly describe the reasoning in the KEP to make it clear for anyone reading it.
I'm going on vacation today. Perhaps @Huang-Wei can take over? For the most part, I don't have a problem with the KEP. Last time, however, the implementation was different from what the KEP presented, intending to cover more use cases, so it was rejected. |
The api definition lgtm. However, I have a lot questions about the original implementations. For example, this one: I don't think we can change the victim during preemption midway. I think we still need to discuss about the implementations. And as the Enhancements Freeze is approaching. I don' think we can bring this in 1.31. We should have more discussion and try introduce it in 1.32 if possible. |
This is the first pr for Guarantee PodDisruptionBudget When Preemption Happens #4664. But notice that Guarantee PodDisruptionBudget When Preemption Happens #4664 is only taking over Guarantee PodDisruptionBudget When Preemption Happens #3280. That's why this pr only contains a few updates on the design details(Most part of it has already been reviewed and approved)