Use `kubectl rollout restart` on single-replica Deployments with PDB #928

applejag · 2024-05-08T10:49:56Z

Goodday! First off, really nice tool ❤️ Works exactly as advertised.

We've hit an issue where we want to have deployments with .spec.replicas: 1, but still make use of PodDisruptionBudget (PDB). Most Kubernetes discussions and Kubernetes documentation basically summarizes this as a "bad idea", usually with the argument of "if you only have 1 replica, then you have already agreed to not having it HA".

When kured evicts a pod (or when kubectl drain evicts one), it immediately terminates the existing pod before the replacement is up and running, leading to unwanted downtime/disruptions. Adding PDB is supposed to solve this, but when you only have 1 replica then the PDB will basically always report .status.disruptionsAllowed: 0, leading to dead-locks in the whole node drain and reboot procedure.

At our workplace we have the our microservice applied to two different use cases:

Few big installation, handling lots of traffic. Here we can just use replicas: 2 or more, and then no issue with PDB and evictions.
Many small installation, each handling a small bit of spiky traffic. Here we can set the resource limits and requests to make them smaller, but the nature of our service results in very spiky usage, so we can't shrink the resources too much. So we want to scale down to replicas: 1 to reduce resource consumption of having a lot of these smaller installations.

This issue is only about use case 2 here.

Proposed solution: use `kubectl rollout restart`

The kubectl rollout restart command basically just adds an annotation to the pod template in the Deployment or StatefulSet resource, forcing it to roll out a new version. In this article, they explain a way around this eviction block by making use of this rollout restart approach: https://www.artur-rodrigues.com/tech/2023/03/30/impossible-kubectl-drains.html

What I envision that kured could do is:

if pod has a PDB with .status.disruptionsAllowd: 0, and is owned by a Deployment with .spec.replicas: 1, then issue a rollout restart instead of eviction
or if pod has annotation kured.dev/strategy: rollout-restart, and is owned by a Deployment, then always use rollout restart approach on it
else, fallback to do regular evictions, same as before

First bullet point would be a "heuristic based approach" that should solve a lot of people's issues, without being too intrusive.

Using a rollout restart approach would restart all of the pods in the deployment, so it's not a given that the user wants to use this when replicas > 1, so for those cases the annotation provides an opt-in behavior.

The text was updated successfully, but these errors were encountered:

ckotzbauer · 2024-05-09T08:14:06Z

Thanks for opening this issue.
Yes, this is a widely discussed topic and we had several issues in the past for kured.
That's an interesting approach which would solve several challenges related to this. However, I'm not sure if this would fit for all statefulset cases, depending on the rollout-strategy defined.

applejag · 2024-05-10T17:26:35Z

Yeah this would practically only help when you have single replica Deployments.

Single replica statefulsets would not really benefit from this, and in that case a PDB makes even less sense as you can't configure a statefulset to have a "surge" of temporarily increased amount of pods.

Maybe an edge case would be if you used the OpenKruise CloneSet with max surge, which is practically an extended Deployment but where you can make use of stuff like PVCs per pod. Which brings up a big downside of this approach: it's not very portable.

Should kured only have a special case for Deployment resources? Is there a way to design it so it becomes agnostic? Should kured have some special cases implemented for CustomResources like the CloneSet from OpenKruise that I mentioned earlier?

In my tunnel-visioned opinion, I think only having a special case for Deployment is fine :) Kind of because that's a solution that would help me. But as this seems to be a common thing, maybe that's fine?

There are some Kubernetes proposals on improving this whole situation, such as this one: kubernetes/enhancements#4213
But that might take a while more until the proposal is accepted, then about a year before we start seeing it in stable releases, and then wait a bit more because most Kubernetes providers intentionally stays a minor version or more behind latest for stability reasons. On top of this, the proposal might also be rejected, and we have to start all over again with a new proposal, which takes a lot of work.

Which more favor into my argument that a "hardcoded fix"/"special behavior for deployments" might be sufficient for now. Wdyt?

Are you up for the idea? And if so, would you accept a PR on it?

github-actions · 2024-07-10T01:51:37Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

applejag · 2024-07-10T05:32:10Z

Still relevant

github-actions · 2024-09-09T01:58:02Z

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

applejag · 2024-09-09T06:06:12Z

Still relevant

evrardjp · 2024-10-19T21:40:10Z

Do you really want to grant kured a cluster wide access to patch deployments, statefulsets, and daemonsets, instead of switching to 2 inside the minimum amount of replicas?

That does not sound like a good trajectory to me. We should aim for least privileges.
Kured should have less rights, not more.

I think however of two ways this could be achieved:

It could be achieved with limited rights with the help of another operator.
We could extend a bit the RBAC of kured, to add the rights on config maps, each config map holding the information about the node. (We reduce the rights of kured on its daemonset to counter balance the added rights, and improve security at the same time). A new drain could write into such configmap instead of actually draining, which then could be used by stakater/reloader to do its work. And there is no way to know if the rollout restart went well. So we would need to rely on timeouts and delays. Side note: We are checking the blockers before draining (not after).

Alternatively, the code could be modified to have a different way to drain, introducing this feature at the expense of security.

Overall, with the current state of Kured, I am not really fond of the approach -- all these changes and decreased security because someone with a microservice doesn't want to deploy 2 pods doesn't sound acceptable.

If we had a mechanism to alter the ordering of the actions based on user input, that might be easier, but it's not part of the code as of today.

evrardjp · 2024-10-19T21:43:38Z

If you have other ideas, I am all ears :)

applejag · 2024-10-19T23:48:29Z

Yeah the idea doesn't sit completely well with me either. Especially since this is not a generic solution as it only applies to Deployments. This idea does not cover StatefulSets nor DaemonSets. It could be considered "too domain specific".

because someone with a microservice doesn't want to deploy 2 pods doesn't sound acceptable

I feel the need to defend this a little. Especially so that it's understood what my motivation is.

The use case is scaling down resources where we have many identical tenants/environments that we want to be as small as possible so that we can fit as many as possible on the same nodes. And scaling down CPU/memory resources vs scaling down replicas do not behave the same (unless you have uncommonly steady load with no spikes whatsoever).

The argument I see often is "well if you only have 1 replica then you're already stating that you're OK with downtime" -well not exactly, because planned downtime and unplanned downtime are two very different things and happen at two very different intervals. I don't want downtime during planned downtime, but I'm OK with unplanned downtime.

Another adjacent use case is when a (micro)service is not designed for multiple replicas. I.e when the job/task can't run in parallel. You then need to figure out how to do some kind of distributed lock via something like Redis, but then you add Redis to the stack and the idea was to scale down. Some concrete examples of this is a "garbage collector"/cleanup type of service (e.g "scan database for old records and edit or delete them"), or a service that scans something (e.g rows in a database table) to then trigger alerts/webhooks based on it (wouldn't want duplicate alerts just because you have 2 replicas). More generally it's when 2 replicas would results in either race conditions or duplicate work.

But back on topic.

Your idea of having a separate operator for this might fit well. Such as an operator that listens for admission webhooks for when a pod of a deployment is being deleted/evicted and then doing this "rollout restart" approach instead. Such a tool would not need to be related to kured at all, and could then more easily be installed on a select few namespaces.

At our workplace we had a very small selection of services where we used single replica with PDB and where we felt we needed a solution for this. We were running "unattended upgrades + kured" on a monthly schedule, which meant that our nodes were restarted once a month and so this issue affected us often. However we've since then reconsidered the impact and either increased the replica count to 2 or removed the PDB (and allowing the brief monthly downtime) on a per-service basis.

Some of the discussions we've had at our workplace of alternative solutions to scaling down is by making these services more "multi-tenant capable", where instead of having the same service duplicated in a lot of tenants/environments to instead only have it once. Then even if you do 2 (or even 3) replicas of that new multi-tenant service it's still less resources than 20 single-replica tenant/environment-specific (i.e non-multi-tenant, 1 per tenant) instances.

From my end I'm OK with you closing this issue now as "Not planned". If the need comes back to us in the future, then we might consider writing the hypothetical operator we talked about here in this thread. But I wouldn't hold my breath.

github-actions bot added the no-issue-activity label Jul 10, 2024

github-actions bot removed the no-issue-activity label Jul 11, 2024

github-actions bot added the no-issue-activity label Sep 9, 2024

github-actions bot removed the no-issue-activity label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `kubectl rollout restart` on single-replica Deployments with PDB #928

Use `kubectl rollout restart` on single-replica Deployments with PDB #928

applejag commented May 8, 2024

ckotzbauer commented May 9, 2024

applejag commented May 10, 2024

github-actions bot commented Jul 10, 2024

applejag commented Jul 10, 2024

github-actions bot commented Sep 9, 2024

applejag commented Sep 9, 2024

evrardjp commented Oct 19, 2024 •

edited

Loading

evrardjp commented Oct 19, 2024

applejag commented Oct 19, 2024

Use kubectl rollout restart on single-replica Deployments with PDB #928

Use kubectl rollout restart on single-replica Deployments with PDB #928

Comments

applejag commented May 8, 2024

Proposed solution: use kubectl rollout restart

ckotzbauer commented May 9, 2024

applejag commented May 10, 2024

github-actions bot commented Jul 10, 2024

applejag commented Jul 10, 2024

github-actions bot commented Sep 9, 2024

applejag commented Sep 9, 2024

evrardjp commented Oct 19, 2024 • edited Loading

evrardjp commented Oct 19, 2024

applejag commented Oct 19, 2024

Use `kubectl rollout restart` on single-replica Deployments with PDB #928

Use `kubectl rollout restart` on single-replica Deployments with PDB #928

Proposed solution: use `kubectl rollout restart`

evrardjp commented Oct 19, 2024 •

edited

Loading