Support out of the box self-scaling for the VPA with updateMode=Recreate #7450

maxcao13 · 2024-10-31T17:57:30Z

Which component are you using?:

Vertical Pod Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

When using the VPA on itself (i.e. creating a VPA object with updateMode=Recreate and minReplicas=1 that targetRef targets the admission-controller deployment), there exists a race between the webhook server admitting its replacement pod, and the the webhook server pod getting terminated. This will result in sometimes the recommended request/limit not being correctly mutated in the new pod definition, and sometimes it does.

Describe the solution you'd like.:

Add a --self-scaling=true/false flag to the admission webhook, that enables a shutdown hook which checks for the existence of a VPA CR targetting itself. If it exists, it waits 30 seconds (terminationGracePeriod) until the replacement pod gets admitted into the webhook, and then the pod can terminate gracefully.

Describe any alternative solutions you've considered.:
Another solution was to just scale the admission-controller deployment to 2 pods if you enabled self-scaling, but that results in more resources wasted than need be.

Another workaround is to just turn updateMode=Off and manually apply requests/limits, but this eliminates the autoscale nature.

And of course, this entire problem would be solved by inplaceverticalscaling since we don't have to recreate the pod anymore, but I think this is a good enough solution until that GAs, because that feature seems far off from now.

Additional context.:

I'd love to know if there are already solutions to this, but I have not found anything online nor any discussion about it.

Also, I already have code for this and tested that this solution works, so if people think this is a good idea or want to see what it looks like, I can submit a PR for it.

The text was updated successfully, but these errors were encountered:

maxcao13 · 2024-10-31T17:57:51Z

/area vertical-pod-autoscaler

adrianmoisey · 2024-10-31T19:04:14Z

This seems like a pretty niche problem. My vote is to run multiple replicas. Adding a feature to allow the VPA to manage itself seems like it's a bit too complex.

adrianmoisey · 2024-10-31T19:05:51Z

Would a preStop hook help here?
May be even the upcoming feature: kubernetes/enhancements#3960

maxcao13 · 2024-10-31T19:27:40Z

Thanks for the response!

For what it's worth, I've seen people actually use self-scaling in their production clusters, and I notice the constant eviction events every minute until the race condition resolves. So it does happen in practice.

I think it could technically work already if we register a preStop hook with the sleep feature (either using that feature gate, or manually execing a sleep), but the tradeoff is that the pod will stay in Terminating state for a constant number of seconds. The solution I have currently has a much faster termination before the a potential sleep ends, if the replacement pod has already been created.

I can create a PR with what I have so people can see how complex it really is. If we still think it's too much for this problem, would it be enough for me to close this with a PR that enables a self-scaling flag that just programmatically registers a shutdown hook in the admission go code that just sleeps for 30 seconds?

adrianmoisey · 2024-10-31T19:38:02Z

I can create a PR with what I have so people can see how complex it really is. If we still think it's too much for this problem, would it be enough for me to close this with a PR that enables a self-scaling flag that just programmatically registers a shutdown hook in the admission go code that just sleeps for 30 seconds?

This still feels very niche to me, when there are other Kubernetes mechanisms in place to protect workloads, such as the ones I've described above.

Additionally, dunning a single admission-controller pod is dangerous for other VPA workloads. If the node it was on were to be deleted (some bad event or regular kubernetes upgrades, perhaps), then the admission-controller wouldn't be around to server other Pod creation webhooks.

Encouraging users to run the admission-controller in a non-HA fashion seems like a bad idea to me.

maxcao13 · 2024-10-31T19:50:46Z

Fair enough, thanks for the insight.

Is this more of a documentation issue then? I think the fact that people are doing this is enough to have some sort of guide or best practices blurb for self-scaling the VPA.

adrianmoisey · 2024-10-31T19:53:29Z

Yup! I 100% agree on that.
I don't think it's documented that you should have multiple admission-controllers running, and a single (active) updater and recommender.
Those are lessons that took me a while to learn.

adrianmoisey · 2024-11-05T06:34:01Z

I'll make a PR to update the documentation
/assign

Shubham82 · 2024-11-05T09:14:59Z

/triage accepted

maxcao13 added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 31, 2024

k8s-ci-robot added the area/vertical-pod-autoscaler label Oct 31, 2024

k8s-ci-robot assigned adrianmoisey Nov 5, 2024

adrianmoisey mentioned this issue Nov 5, 2024

Clarify how to run VPA with multiple Pods #7463

Open

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support out of the box self-scaling for the VPA with updateMode=Recreate #7450

Support out of the box self-scaling for the VPA with updateMode=Recreate #7450

maxcao13 commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

adrianmoisey commented Nov 5, 2024

Shubham82 commented Nov 5, 2024

Support out of the box self-scaling for the VPA with updateMode=Recreate #7450

Support out of the box self-scaling for the VPA with updateMode=Recreate #7450

Comments

maxcao13 commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

maxcao13 commented Oct 31, 2024

adrianmoisey commented Oct 31, 2024

adrianmoisey commented Nov 5, 2024

Shubham82 commented Nov 5, 2024