[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

emla9 · 2024-01-04T16:05:27Z

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: v1.0.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: v1.25.11
Kustomize Version: v4.5.7
Server Version: v1.25.16-eks-8cb36c9

What environment is this in?:

EKS 1.25

What did you expect to happen?:

Expected that VPA would respond to OOM by adjusting the memory recommendation according to the documented formula:

recommendation = memory-usage-in-oomkill-event + max(oom-min-bump-up-bytes, memory-usage-in-oomkill-event * oom-bump-up-ratio)

What happened instead?:

Containers OOM continuously; VPA's recommendation is never adjusted.

How to reproduce it (as minimally and precisely as possible):

The problem was originally observed with the datadog-agent Daemonset. Its resource needs can vary by node depending on how many pods are running there and the amount of metrics emitted. Sometimes the gap is quite signifant, like by a factor of 6. The issue is reproducible with a stress test.

Create a Deployment whose pods allocate a random amount of memory such that ~10% of them should OOM:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vpa-stress
spec:
  replicas: 10
  selector:
    matchLabels:
      app: vpa-stress
  template:
    metadata:
      labels:
        app: vpa-stress
    spec:
      containers:
        - name: vpa-stress
          image: docker.io/elizabethla/vpa-stress:latest
          resources:
            limits:
              cpu: '1'
              memory: 200Mi
            requests:
              cpu: 50m
              memory: 200Mi

Most vpa-stress pods will allocate between 100-160Mi of memory. The remaining 10% will allocate 650Mi of memory, causing OOM. If no OOM occurs after a couple of minutes, replace some pods to roll the dice again.

Turn up verbosity on the VPA updater:

containers:
  - name: updater
    args:
      - --v=4

Create a VPA for the Deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-stress
spec:
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        controlledResources:
          - cpu
          - memory
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-stress
  updatePolicy:
    minReplicas: 1
    updateMode: Auto

Expect logs like this from the VPA updater:

I1228 16:14:47.820536       1 update_priority_calculator.go:114] quick OOM detected in pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h, container vpa-stress
I1228 16:14:47.820550       1 update_priority_calculator.go:140] not updating pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h because resource would not change

Anything else we need to know?:

VPA responds to OOM as expected when resource utilization across pods is uniform. As far as I can tell, the issue arises from the fact that only a small percentage of pods actually experience OOM. The target memory percentile of 90% is not affected by relatively infrequent OOM samples. This means that VPA never recommends a memory increase.

In applying VPA to a Daemonset such as the datadog-agent that may have non-uniform memory usage, I do not expect to reduce resource waste (all pods are subject to the same recommendation), but rather to reduce toil around manually adjusting the Daemonset's memory requests and limits as its requirements change with the workloads running on the cluster.

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-04-03T16:48:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

emla9 · 2024-04-03T19:31:57Z

/remove-lifecycle stale

voelzmo · 2024-04-30T08:46:04Z

This is interesting and possibly related to #6705
@emla9: Are you observing the same messages in the recommender logs as described in that issue? We might close this one and move the conversation to #6705 instead, as it has a bit more context already.

emla9 · 2024-05-01T22:50:46Z

Thanks for taking a look at this, @voelzmo. I reran the stress test with VPA 1.0.0 to be sure: there are no KeyErrors in the recommender logs about vpa-stress pods. When we initially noticed this issue with the datadog-agent, we did see some KeyErrors, but only in 6 out of 62 total OOMs observed during the course of an hour. #6705 possibly contributes to the issue here when OOMs are very fast, but the problem exists even without anyKeyErrors.

voelzmo · 2024-05-06T14:22:09Z

Hey @emla9 thanks for checking for those KeyErrors. As mentioned in #6660, I think I understand how this would solve the issue you're describing here.

felipewnp · 2024-06-19T21:30:58Z

Hi guys!

I saw that this was merged into main, but I think it didn't make it into vpa-recommender:1.1.2, since it's throwing an error saying unknown flag: --target-memory-percentile .

Is there any build/image with these changes yet?

Or would I need to build a custom image from main?

emla9 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2024

emla9 mentioned this issue Mar 26, 2024

[VPA] Configurable upper and lower bounds for memory and cpu recommendations #6660

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024

k8s-ci-robot closed this as completed in #6660 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

emla9 commented Jan 4, 2024

k8s-triage-robot commented Apr 3, 2024

emla9 commented Apr 3, 2024

voelzmo commented Apr 30, 2024

emla9 commented May 1, 2024

voelzmo commented May 6, 2024

felipewnp commented Jun 19, 2024

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

Comments

emla9 commented Jan 4, 2024

k8s-triage-robot commented Apr 3, 2024

emla9 commented Apr 3, 2024

voelzmo commented Apr 30, 2024

emla9 commented May 1, 2024

voelzmo commented May 6, 2024

felipewnp commented Jun 19, 2024