-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add requiredDuringSchedulingRequiredDuringExecution
to ClusterResourcePlacement affinity
#715
Comments
@ryanzhang-oss I'm starting to dig into this so if you could please assign me to this issue I'd appreciate it! |
@nojnhuh Even k8s does not support requiredDuringSchedulingRequiredDuringExecution, I wonder why do we want to support that? Also what does "requiredDuringSchedulingRequiredDuringExecution" mean semantically? |
This would mean the same thing as the placeholder definition for the same field for placing a Pod on a Node, but for scheduling workloads onto clusters: https://github.com/kubernetes/kubernetes/blob/634fc1b4836b3a500e0d715d71633ff67690526a/staging/src/k8s.io/api/core/v1/types.go#L3449-L3456
This would help with the use case I outlined above where conditions on a member cluster change such that it's no longer suitable to run certain workloads. Then fleet can reschedule affected workloads without relying on a change to the ClusterResourcePlacement to trigger the reschedule. |
Thanks @nojnhuh. Just to clarify, there are two cases, to schedule a workload to newly GPU available cluster is actually a requiredDuringScheduleTime case since the workload is not scheduled if there is no GPU cluster available. The workload will be scheduled to a cluster automatically when we detect that GPU is added to the cluster. This is already supported today in fleet. On the flip side, when a workload is already running in a cluster, we don't evict it unless the cluster is deleted, the same behavior as k8s. I think there is a reason why k8s never implements that feature. The main reason is continuously trying to reschedule all workloads will add huge load to our scheduler which is the performance bottleneck. Since we didn't get any feature request to support this from our customers, we don't think the benefit out weight the huge performance hit. We can revisit this if there are strong use cases coming from customers and even with that, I suspect we need to scope down the semantics to ensure performance. |
There is an active KEP right now in upstream k8s to solve for The intent to solve for this is longstanding: Additionally, the widely used descheduler project implements this as well for folks who have needed this functionality prior to its landing in k/k:
The above is a true statement. We wouldn't want to continuously reschedule. Rather we would want to continuously determine "do I need to reschedule?", which would looks something like (1) ensuring that I would like to be both a customer and implementer of this in fleet, so it makes sense to me to keep the issue open as a reference for the resultant PR. |
Thanks, Jack. I am keeping this issue open. However, I don't think there is a way to determine "do I need to reschedule" without actually scheduling it. Also, just continuously "determine" is already a huge cost. IMO, the right way to solve this problem is to deploy a descheduler instead of within the scheduler. We are planning for a descheduler already. In any case, we would like to see a design first before moving forward with any code change. |
This is a "descheduler" to me |
Thx for re-opening!
This is the way:
The multi-cluster actor does not need to actually schedule anything in order to determine if it needs to be rescheduled. It simply needs to be aware of the delta between its desired goal state (this workload is operational on cluster XYZ) and the actual goal state (this workload is stuck Pending on cluster XYZ). When such a delta is observed, the entire E2E multi-cluster scheduling operation kicks in, with the new nuance that cluster XYZ is no longer considered as a target cluster for scheduling (we already know that the workload doesn't run there). |
So IIRC,
I wonder how do you solve the second part? In addition, the second part is actually already part of the advanced rollout feature as we will provide options for customer when we detect the resources placed are not in goal state. Currently, we don't plan to offer "reschedule" option but that's not hard to add. |
The current KEP4329 has not been approved yet. I have the same question listed in the https://github.com/kubernetes/enhancements/pull/4329/files#r1478023120. not sure what's the benefit of adding into the node affinity instead of using the descheduler and what's the boundary between these two? Probably we can hold until sig gets to the conclusion? |
Cool, are there PRs that are implementing "advanced rollout"? |
https://github.com/Azure/fleet/pull/689/files is the one we support checking the availability of the native resource. More PRs are coming. |
In ClusterResourcePlacement's affinity definitions, adding
requiredDuringSchedulingRequiredDuringExecution
would enable the scheduler to react to underlying changes to a member cluster over time that affect its ability to run certain workloads.One concrete use case might be to ensure that workloads only run on clusters that contain GPU nodes. As nodes are added to and removed from a cluster, whether or not any GPU nodes exist in a cluster may change over time. As a cluster operator detects these changes and updates some label on the member clusters to indicate whether or not GPU nodes are available, Fleet would automatically reschedule workloads that require GPU nodes onto a different member cluster.
The text was updated successfully, but these errors were encountered: