-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Thanks for taking a look at this, @voelzmo. I reran the stress test with VPA 1.0.0 to be sure: there are no |
Hi guys! I saw that this was merged into main, but I think it didn't make it into Is there any build/image with these changes yet? Or would I need to build a custom image from main? |
Which component are you using?:
vertical-pod-autoscaler
What version of the component are you using?:
Component version: v1.0.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
EKS 1.25
What did you expect to happen?:
Expected that VPA would respond to OOM by adjusting the memory recommendation according to the documented formula:
What happened instead?:
Containers OOM continuously; VPA's recommendation is never adjusted.
How to reproduce it (as minimally and precisely as possible):
The problem was originally observed with the
datadog-agent
Daemonset. Its resource needs can vary by node depending on how many pods are running there and the amount of metrics emitted. Sometimes the gap is quite signifant, like by a factor of 6. The issue is reproducible with a stress test.Create a Deployment whose pods allocate a random amount of memory such that ~10% of them should OOM:
Most
vpa-stress
pods will allocate between 100-160Mi of memory. The remaining 10% will allocate 650Mi of memory, causing OOM. If no OOM occurs after a couple of minutes, replace some pods to roll the dice again.Turn up verbosity on the VPA updater:
Create a VPA for the Deployment:
Expect logs like this from the VPA updater:
Anything else we need to know?:
VPA responds to OOM as expected when resource utilization across pods is uniform. As far as I can tell, the issue arises from the fact that only a small percentage of pods actually experience OOM. The target memory percentile of 90% is not affected by relatively infrequent OOM samples. This means that VPA never recommends a memory increase.
In applying VPA to a Daemonset such as the
datadog-agent
that may have non-uniform memory usage, I do not expect to reduce resource waste (all pods are subject to the same recommendation), but rather to reduce toil around manually adjusting the Daemonset's memory requests and limits as its requirements change with the workloads running on the cluster.The text was updated successfully, but these errors were encountered: