-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support out of the box self-scaling for the VPA with updateMode=Recreate #7450
Comments
/area vertical-pod-autoscaler |
This seems like a pretty niche problem. My vote is to run multiple replicas. Adding a feature to allow the VPA to manage itself seems like it's a bit too complex. |
Would a preStop hook help here? |
Thanks for the response! For what it's worth, I've seen people actually use self-scaling in their production clusters, and I notice the constant eviction events every minute until the race condition resolves. So it does happen in practice. I think it could technically work already if we register a preStop hook with the sleep feature (either using that feature gate, or manually execing a sleep), but the tradeoff is that the pod will stay in Terminating state for a constant number of seconds. The solution I have currently has a much faster termination before the a potential sleep ends, if the replacement pod has already been created. I can create a PR with what I have so people can see how complex it really is. If we still think it's too much for this problem, would it be enough for me to close this with a PR that enables a |
This still feels very niche to me, when there are other Kubernetes mechanisms in place to protect workloads, such as the ones I've described above. Additionally, dunning a single admission-controller pod is dangerous for other VPA workloads. If the node it was on were to be deleted (some bad event or regular kubernetes upgrades, perhaps), then the admission-controller wouldn't be around to server other Pod creation webhooks. Encouraging users to run the admission-controller in a non-HA fashion seems like a bad idea to me. |
Fair enough, thanks for the insight. Is this more of a documentation issue then? I think the fact that people are doing this is enough to have some sort of guide or best practices blurb for self-scaling the VPA. |
Yup! I 100% agree on that. |
I'll make a PR to update the documentation |
/triage accepted |
Which component are you using?:
Vertical Pod Autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
When using the VPA on itself (i.e. creating a VPA object with
updateMode=Recreate
andminReplicas=1
thattargetRef
targets theadmission-controller
deployment), there exists a race between the webhook server admitting its replacement pod, and the the webhook server pod getting terminated. This will result in sometimes the recommended request/limit not being correctly mutated in the new pod definition, and sometimes it does.Describe the solution you'd like.:
Add a
--self-scaling=true/false
flag to the admission webhook, that enables a shutdown hook which checks for the existence of a VPA CR targetting itself. If it exists, it waits 30 seconds (terminationGracePeriod) until the replacement pod gets admitted into the webhook, and then the pod can terminate gracefully.Describe any alternative solutions you've considered.:
Another solution was to just scale the admission-controller deployment to 2 pods if you enabled self-scaling, but that results in more resources wasted than need be.
Another workaround is to just turn
updateMode=Off
and manually apply requests/limits, but this eliminates theautoscale
nature.And of course, this entire problem would be solved by
inplaceverticalscaling
since we don't have to recreate the pod anymore, but I think this is a good enough solution until that GAs, because that feature seems far off from now.Additional context.:
I'd love to know if there are already solutions to this, but I have not found anything online nor any discussion about it.
Also, I already have code for this and tested that this solution works, so if people think this is a good idea or want to see what it looks like, I can submit a PR for it.
The text was updated successfully, but these errors were encountered: