ImagePolicy based deployments with PodDisruptionBudget when up against a resource constraint #4997

eli-persona · 2024-09-23T20:02:12Z

eli-persona
Sep 23, 2024

Consider the following stripped down example:

Below is the minimal spec of a GPU-accelerated inference service deployment, with 2 replicas.
- Each pod consumes a single GPU.
Let's also say I have exactly 2 GPUs in my cluster.

What currently happens:

Two pods are spun up, running the "old" version of the code.
When a new image revision is found by the image policy; it spins up a pending pod... but it cannot be scheduled as all GPUs are taken!
Thus the deployed image never really flips over; requiring manual intervention.

What I'd like to happen:

New image policy is picked up.
At least one pod is alive at any given time
New code is fully deployed onto two pods

Does anyone have advice to how to get this unstuck, while ensuring a zero-downtime deployment? I'm having trouble constructing a way to allow for pre-emption and a PodDisruptionBudget, while making sure that the newest version has highest priority; As under the hood, there is a single Deployment configuration shown here

For example, is there a way to parse out a timestamp from the imagepolicy, to use in configuring a priority? The bounds of what's supported by these comments is not fully documented (though the docs are pretty good :)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference
  namespace: inference
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: inference
        image: ... # {"$imagepolicy": "flux-system:inference"}

      resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImagePolicy based deployments with PodDisruptionBudget when up against a resource constraint #4997

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

ImagePolicy based deployments with PodDisruptionBudget when up against a resource constraint #4997

eli-persona Sep 23, 2024

Replies: 0 comments

eli-persona
Sep 23, 2024