Replies: 7 comments
-
Communication with what times out? Kubernetes, or your web app? |
Beta Was this translation helpful? Give feedback.
-
Well, the web apps stop working the moment I shut down that master. And each kubectl operation starts taking a couple of seconds to complete. Okay, it turns out those pods get rescheduled after 5 minutes. So when that node that holds coredns traefik and such dies, everything is unavailable for 5 minutes. Edit: Okay as it turns out, stuff gets rescheduled on another master after 5 minutes. Which is way too long of a downtime. |
Beta Was this translation helpful? Give feedback.
-
I think the most "proper" solution for that would be to increase number of replicas for critical components like these - perhaps with some affinities set to be safe (so they don't get placed on the same node). Alternatively, you can try changing some controller-manager flags - |
Beta Was this translation helpful? Give feedback.
-
The pod replacement stuff is all core Kubernetes behavior, and can be tuned with args to the scheduler or controller-manager. For minimum downtime, you're probably better off ensuring that there are more replicas running on other nodes - due to its eventual-consistency model, Kubernetes is never going to offer instantaneous replacement of pods lost due to nodes unexpectedly going offline. |
Beta Was this translation helpful? Give feedback.
-
Okay, but is it save to scale up things like coredns or the traefik ingress? I pretty much run everything the way it was installed by k3s |
Beta Was this translation helpful? Give feedback.
-
Both CoreDNS and Traefik are stateless so it's safe |
Beta Was this translation helpful? Give feedback.
-
but things are installed via helm and rancher gives me a warning that things get overwritten. Is there a way to make that setting persist after updates? |
Beta Was this translation helpful? Give feedback.
-
Environmental Info:
K3s Version:
k3s version v1.20.5+k3s1 (355fff3)
go version go1.15.10
Node(s) CPU architecture, OS, and Version:
2vCPU, 4GB Ram, x64 Ubuntu 20.04
Cluster Configuration:
Describe the bug:
The cluster becomes unavailable when the master-nbg1-1 nodes is taken offline.
This node is the one that initialized the cluster.
Steps To Reproduce:
internally the all the nodes share a private network
Expected behavior:
Cluster and deployed things like websites stay available
Actual behavior:
All communication times out.
I have no clue how to fix that, my best bet is that the ingress traefik is not getting resheduled on another master (or maybe I didn't wait long enough, as I just rebooted the machine)
Beta Was this translation helpful? Give feedback.
All reactions