delete pdb, if all pods are not in a running state #27

szuecs · 2019-09-23T09:17:43Z

We observed an issue, where prometheus statefulset with 2 replicas were in a not running state, crashing all the time.
In a discussion it turned out that there is probably a 5 minutes timeout before deleting the PDB.
The argument is, if all pods are crashing that match by a PDB, then you can safely delete the PDB to help with faster recovery.

mikkeloscar · 2019-09-24T19:56:45Z

We have a 5 minutes ttl defined here: https://github.com/zalando-incubator/kubernetes-on-aws/blob/89b380939fd34dcbc9af347a55c2f70e36755c70/cluster/manifests/prometheus/statefulset.yaml#L5 however, because of a bug (fixed in #28) this ttl was never actually effective.

With this bug fixed I suggest we try with the 5 minutes ttl and see how effective it is. We could also lower it a bit, but the reason we may not want to completely remove it is that we determine if a PDB should be removed by looking at pod ready state which may take a bit if the pods have a slow startup. We could ofc. also look at a more specific signal like crashloopbackoff but I would rather stay with the simple generic signal of PodReady state and a ttl unless we really need to have a very specific check.

WDYT?

szuecs · 2019-09-24T20:04:45Z

@mikkeloscar fine for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete pdb, if all pods are not in a running state #27

delete pdb, if all pods are not in a running state #27

szuecs commented Sep 23, 2019

mikkeloscar commented Sep 24, 2019

szuecs commented Sep 24, 2019

delete pdb, if all pods are not in a running state #27

delete pdb, if all pods are not in a running state #27

Comments

szuecs commented Sep 23, 2019

mikkeloscar commented Sep 24, 2019

szuecs commented Sep 24, 2019