KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

AxeZhan · 2024-05-25T14:29:34Z

One-line PR description: Guarantee PodDisruptionBudget When Preemption Happens.

Issue link: Guarantee PodDisruptionBudget When Preemption Happens #4664

Other comments:
This is the first pr for Guarantee PodDisruptionBudget When Preemption Happens #4664. But notice that Guarantee PodDisruptionBudget When Preemption Happens #4664 is only taking over Guarantee PodDisruptionBudget When Preemption Happens #3280. That's why this pr only contains a few updates on the design details(Most part of it has already been reviewed and approved)

AxeZhan · 2024-05-25T14:59:10Z

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/README.md

@@ -369,15 +369,11 @@ when selecting victims.
 - if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod, 
  the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs 
  are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted 
-  via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`
-  and the scheduler will fallback to deletion as an alternative.
+  using the client-go's delete method.


This part is confusing to me. Are we tring to introduce /evictions endpoint in this KEP ?

IIRC, we don't using eviction in preemption right now. https://github.com/kubernetes/kubernetes/blob/4a668bcf143c0652882d2e59051dcf0cd843c24c/pkg/scheduler/util/utils.go#L137

Besides, I don't think we should use /evictions endpoint here.
During the preemption, we'll separate victims into two parts: nonViolatingCandidates, violatingCandidates.

Since we already know in advance which pods' deletion would violate the PDB.

Using eviction on nonViolatingCandidates will equal to delete it directly(?)

Using eviction on violatingCandidates will always return 429. And since we have already deleted all nonViolatingCandidates on the same node before trying to delete violatingCandidates, this will results in all violatingCandidates can't be deleted and thus preemption will always fail when deleting all nonViolatingCandidates is not enough.

Besides, if we use /evictions endpoint for all pods in preemption. Then what's the point of this KEP? We'll deny any preempter if pdb is violated.
Also, use /evictions endpoint for all pods would also be a breaking change and is not backward compatible.
@alculquicondor @Huang-Wei

Please take a look at the original issue to understand the drive behind this KEP.

Also take a look at this comment to understand the reason behind the eviction first policy.

Thanks! I read the original issue and comment you left in #3280

The drive behind the KEP and my understanding while reading it seem to be different. I thought the KEP was introducing a new field to make it harder for specific pods to be preempted during scheduling. However, it appears that the original intention is for the scheduler to respect PDBs in every preemptive case.

In this case, I agree that we should use eviction api in the whole preemption.

AxeZhan · 2024-05-30T14:56:12Z

ping @alculquicondor @Huang-Wei

keps/prod-readiness/sig-scheduling/4664.yaml

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/kep.yaml

k8s-ci-robot · 2024-06-05T02:59:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AxeZhan
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign huang-wei for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree

/hold

atiratree · 2024-06-05T21:46:18Z

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/README.md

@@ -369,15 +369,11 @@ when selecting victims.
 - if the priority of the preemptor is greater than or equal to the value of `AllowDisruptionByPriorityGreaterThanOrEqual` in victim pod, 
  the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs 
  are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted 
-  via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`
-  and the scheduler will fallback to deletion as an alternative.
+  using the client-go's delete method.


Please take a look at the original issue to understand the drive behind this KEP.

Also take a look at this comment to understand the reason behind the eviction first policy.

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/README.md

AxeZhan · 2024-06-06T06:40:17Z

I still have a question to be clarified:

I think the design details in original KEP conflicts much with current implementation of preemption:
In current implementation, we dry run the preemption to get a list of candidates(each candidate is a struct with a nodeName and a group of victim pods belong to the node).

So how can we achieve this:

If it responses
429 Too Many Requests, the scheduler will output an error log and choose another victim among the candidate victims to preempt
until it succeeds or there are no more candidates.

Suppose dry run shows our best candidate is nodeA with Pod1,Pod2, and Pod3.

So we begin to process preemption:

We evict Pod1 with status 200.
We evict Pod2, but received a 429, what now? The original KEP says we should choose another victim, but in this case, even if we successfully evict Pod3, we can't make up enough space for preemptor on NodeA. Then what? Can we even revert eviction for Pod1? I don't think so.

I'm not clear of how we should handle this situation:

Priority of the preemptor is less than the value of AllowDisruptionByPriorityGreaterThanOrEqual in victims.
During dryRun, deleting pod2 will not violate its PDB.
Some thing happens, and now deleting pod2 will violate its PDB.
Preemption begins, and we received 429 when we trying to evict pod2, and since (1), we can not delete pod2 now also.
Then what????

alculquicondor

Did you make sure that the PRR is up to date?

alculquicondor · 2024-06-12T18:55:41Z

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/kep.yaml

 authors:
  - "@denkensk"
+  - "AxeZhan"


Suggested change

- "AxeZhan"

- "@AxeZhan"

alculquicondor · 2024-06-12T18:59:54Z

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/README.md

+  via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`, 
+  the scheduler will fallback to deletion as an alternative.


I can't remember why this was acceptable. Why not use DELETE directly, given that there is a high chance that the /evictions endpoint will just fail?

Please take a look at #4665 (comment). I think we should properly describe the reasoning in the KEP to make it clear for anyone reading it.

alculquicondor · 2024-06-12T19:01:49Z

I'm going on vacation today. Perhaps @Huang-Wei can take over?

For the most part, I don't have a problem with the KEP. Last time, however, the implementation was different from what the KEP presented, intending to cover more use cases, so it was rejected.
Please make sure that the design present in the KEP matches your expectations and use cases.

AxeZhan · 2024-06-13T03:09:36Z

The api definition lgtm. However, I have a lot questions about the original implementations.

For example, this one:
#4665 (comment)

I don't think we can change the victim during preemption midway.

I think we still need to discuss about the implementations. And as the Enhancements Freeze is approaching. I don' think we can bring this in 1.31. We should have more discussion and try introduce it in 1.32 if possible.

k8s-ci-robot requested review from alculquicondor and Huang-Wei May 25, 2024 14:29

AxeZhan force-pushed the preemption branch from 366dd56 to 7d0cc25 Compare May 25, 2024 14:35

AxeZhan mentioned this pull request May 25, 2024

Guarantee PodDisruptionBudget When Preemption Happens #4664

Open

4 tasks

AxeZhan commented May 25, 2024

View reviewed changes

alculquicondor mentioned this pull request May 27, 2024

KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens #3755

Merged

wojtek-t self-assigned this Jun 3, 2024

sftim reviewed Jun 4, 2024

View reviewed changes

keps/prod-readiness/sig-scheduling/4664.yaml Outdated Show resolved Hide resolved

keps/sig-scheduling/4664-guarantee-pdb-when-preemption-happens/kep.yaml Show resolved Hide resolved

AxeZhan force-pushed the preemption branch from 7d0cc25 to e922873 Compare June 5, 2024 02:59

AxeZhan force-pushed the preemption branch from e922873 to 5a8abe7 Compare June 5, 2024 03:00

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 5, 2024

atiratree reviewed Jun 5, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2024

KEP4664: update design details

b16e603

AxeZhan force-pushed the preemption branch from 5a8abe7 to b16e603 Compare June 6, 2024 05:46

alculquicondor reviewed Jun 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

AxeZhan commented May 25, 2024

AxeZhan May 25, 2024

atiratree Jun 5, 2024

AxeZhan Jun 6, 2024

AxeZhan commented May 30, 2024

k8s-ci-robot commented Jun 5, 2024

atiratree left a comment

atiratree Jun 5, 2024

AxeZhan commented Jun 6, 2024

alculquicondor left a comment

alculquicondor Jun 12, 2024

alculquicondor Jun 12, 2024

atiratree Jun 13, 2024

alculquicondor commented Jun 12, 2024

AxeZhan commented Jun 13, 2024

		via the `/evictions` endpoint despite their PDBs being violated. If the `/eviction` endpoint returns a response `429 Too Many Requests`,
		the scheduler will fallback to deletion as an alternative.

KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

Are you sure you want to change the base?

KEP4664: Guarantee PodDisruptionBudget When Preemption Happens #4665

Conversation

AxeZhan commented May 25, 2024

AxeZhan May 25, 2024

Choose a reason for hiding this comment

atiratree Jun 5, 2024

Choose a reason for hiding this comment

AxeZhan Jun 6, 2024

Choose a reason for hiding this comment

AxeZhan commented May 30, 2024

k8s-ci-robot commented Jun 5, 2024

atiratree left a comment

Choose a reason for hiding this comment

atiratree Jun 5, 2024

Choose a reason for hiding this comment

AxeZhan commented Jun 6, 2024

alculquicondor left a comment

Choose a reason for hiding this comment

alculquicondor Jun 12, 2024

Choose a reason for hiding this comment

alculquicondor Jun 12, 2024

Choose a reason for hiding this comment

atiratree Jun 13, 2024

Choose a reason for hiding this comment

alculquicondor commented Jun 12, 2024

AxeZhan commented Jun 13, 2024