Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

Closed
emla9 opened this issue Jan 4, 2024 · 6 comments · Fixed by #6660
Closed

[VPA] Does not respond to OOM for workloads with non-uniform resource utilization #6420

emla9 opened this issue Jan 4, 2024 · 6 comments · Fixed by #6660
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@emla9
Copy link
Contributor

emla9 commented Jan 4, 2024

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: v1.0.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.25.11
Kustomize Version: v4.5.7
Server Version: v1.25.16-eks-8cb36c9

What environment is this in?:

EKS 1.25

What did you expect to happen?:

Expected that VPA would respond to OOM by adjusting the memory recommendation according to the documented formula:

recommendation = memory-usage-in-oomkill-event + max(oom-min-bump-up-bytes, memory-usage-in-oomkill-event * oom-bump-up-ratio)

What happened instead?:

Containers OOM continuously; VPA's recommendation is never adjusted.

How to reproduce it (as minimally and precisely as possible):

The problem was originally observed with the datadog-agent Daemonset. Its resource needs can vary by node depending on how many pods are running there and the amount of metrics emitted. Sometimes the gap is quite signifant, like by a factor of 6. The issue is reproducible with a stress test.

Create a Deployment whose pods allocate a random amount of memory such that ~10% of them should OOM:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vpa-stress
spec:
  replicas: 10
  selector:
    matchLabels:
      app: vpa-stress
  template:
    metadata:
      labels:
        app: vpa-stress
    spec:
      containers:
        - name: vpa-stress
          image: docker.io/elizabethla/vpa-stress:latest
          resources:
            limits:
              cpu: '1'
              memory: 200Mi
            requests:
              cpu: 50m
              memory: 200Mi

Most vpa-stress pods will allocate between 100-160Mi of memory. The remaining 10% will allocate 650Mi of memory, causing OOM. If no OOM occurs after a couple of minutes, replace some pods to roll the dice again.

Turn up verbosity on the VPA updater:

containers:
  - name: updater
    args:
      - --v=4

Create a VPA for the Deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-stress
spec:
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        controlledResources:
          - cpu
          - memory
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vpa-stress
  updatePolicy:
    minReplicas: 1
    updateMode: Auto

Expect logs like this from the VPA updater:

I1228 16:14:47.820536       1 update_priority_calculator.go:114] quick OOM detected in pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h, container vpa-stress
I1228 16:14:47.820550       1 update_priority_calculator.go:140] not updating pod vpa-stress-dev/vpa-stress-56b94449d8-w7r2h because resource would not change

Anything else we need to know?:

VPA responds to OOM as expected when resource utilization across pods is uniform. As far as I can tell, the issue arises from the fact that only a small percentage of pods actually experience OOM. The target memory percentile of 90% is not affected by relatively infrequent OOM samples. This means that VPA never recommends a memory increase.

In applying VPA to a Daemonset such as the datadog-agent that may have non-uniform memory usage, I do not expect to reduce resource waste (all pods are subject to the same recommendation), but rather to reduce toil around manually adjusting the Daemonset's memory requests and limits as its requirements change with the workloads running on the cluster.

@emla9 emla9 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024
@emla9
Copy link
Contributor Author

emla9 commented Apr 3, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2024
@voelzmo
Copy link
Contributor

voelzmo commented Apr 30, 2024

This is interesting and possibly related to #6705
@emla9: Are you observing the same messages in the recommender logs as described in that issue? We might close this one and move the conversation to #6705 instead, as it has a bit more context already.

@emla9
Copy link
Contributor Author

emla9 commented May 1, 2024

Thanks for taking a look at this, @voelzmo. I reran the stress test with VPA 1.0.0 to be sure: there are no KeyErrors in the recommender logs about vpa-stress pods. When we initially noticed this issue with the datadog-agent, we did see some KeyErrors, but only in 6 out of 62 total OOMs observed during the course of an hour. #6705 possibly contributes to the issue here when OOMs are very fast, but the problem exists even without anyKeyErrors.

@voelzmo
Copy link
Contributor

voelzmo commented May 6, 2024

Hey @emla9 thanks for checking for those KeyErrors. As mentioned in #6660, I think I understand how this would solve the issue you're describing here.

@felipewnp
Copy link

Hi guys!

I saw that this was merged into main, but I think it didn't make it into vpa-recommender:1.1.2, since it's throwing an error saying unknown flag: --target-memory-percentile .

Is there any build/image with these changes yet?

Or would I need to build a custom image from main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants