Triple Master Cluster with Embedded Etcd HA becomes unavailable 1 node dies #3218

ruohki · 2021-04-18T03:01:45Z

ruohki
Apr 18, 2021

Environmental Info:
K3s Version:
k3s version v1.20.5+k3s1 (355fff3)
go version go1.15.10

Node(s) CPU architecture, OS, and Version:
2vCPU, 4GB Ram, x64 Ubuntu 20.04

Cluster Configuration:

worker       Ready    <none>                      66m     v1.20.5+k3s1
load-balancer   Ready    <none>                      69m     v1.20.5+k3s1
master-fsn1-1   Ready    control-plane,etcd,master   75m     v1.20.5+k3s1
master-hel1-1   Ready    control-plane,etcd,master   4m44s   v1.20.5+k3s1
master-nbg1-1   Ready    control-plane,etcd,master   77m     v1.20.5+k3s1

Describe the bug:
The cluster becomes unavailable when the master-nbg1-1 nodes is taken offline.
This node is the one that initialized the cluster.

Steps To Reproduce:

Installed K3s:

# First master
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="\
   --write-kubeconfig ~/.kube/config \
   --write-kubeconfig-mode 644 \
   --tls-san <public node ip> \
   --cluster-init" \
   K3S_TOKEN="SUPERSECRETTOKEN" sh -

# Other master
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC=" \
   --write-kubeconfig ~/.kube/config \
   --write-kubeconfig-mode 644 \
   --tls-san <public node ip> \
   --server https://<loadbalancer public ip>:6443"   K3S_TOKEN="SUPERSECRETTOKEN"   sh -

# nginx.conf on the loadbalancer

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
        worker_connections 8192;
        # multi_accept on;
}

http {
  server {
    listen 8080;
    location /health {
      return 200 'I am aliiiiiiive!';
    }
  }
}

stream {
  upstream k3s_servers_api {
    server 10.0.0.3:6443 max_fails=3 fail_timeout=5s;
    server 10.0.0.4:6443 max_fails=3 fail_timeout=5s;
    server 10.0.0.5:6443 max_fails=3 fail_timeout=5s;
  }
  server {
    listen 6443;
    proxy_pass k3s_servers_api;
  }

  upstream k3s_servers_http {
    least_conn;
    server 10.0.0.3:80 max_fails=3 fail_timeout=5s;
    server 10.0.0.4:80 max_fails=3 fail_timeout=5s;
    server 10.0.0.5:80 max_fails=3 fail_timeout=5s;
  }
  server {
    listen 80;
    proxy_pass k3s_servers_http;
  }

  upstream k3s_servers_https {
    least_conn;
    server 10.0.0.3:443 max_fails=3 fail_timeout=5s;
    server 10.0.0.4:443 max_fails=3 fail_timeout=5s;
    server 10.0.0.4:443 max_fails=3 fail_timeout=5s;
  }
  server {
    listen 443;
    proxy_pass k3s_servers_https;
  }
}

internally the all the nodes share a private network

Expected behavior:
Cluster and deployed things like websites stay available

Actual behavior:
All communication times out.

I have no clue how to fix that, my best bet is that the ingress traefik is not getting resheduled on another master (or maybe I didn't wait long enough, as I just rebooted the machine)

brandond · 2021-04-18T06:17:15Z

brandond
Apr 18, 2021
Collaborator

Communication with what times out? Kubernetes, or your web app?

0 replies

ruohki · 2021-04-18T09:28:52Z

ruohki
Apr 18, 2021
Author

Well, the web apps stop working the moment I shut down that master. And each kubectl operation starts taking a couple of seconds to complete. Okay, it turns out those pods get rescheduled after 5 minutes. So when that node that holds coredns traefik and such dies, everything is unavailable for 5 minutes.

Edit: Okay as it turns out, stuff gets rescheduled on another master after 5 minutes. Which is way too long of a downtime.
Is there a way to lower that period for pods and deployments in the Kube-system namespace?

0 replies

eplightning · 2021-04-18T19:41:22Z

eplightning
Apr 18, 2021

I think the most "proper" solution for that would be to increase number of replicas for critical components like these - perhaps with some affinities set to be safe (so they don't get placed on the same node).

Alternatively, you can try changing some controller-manager flags - pod-eviction-timeout (5 minutes by default) would probably be your best bet. I think that approach might cause some issues though so I wouldn't recommend that.

0 replies

brandond · 2021-04-18T19:41:47Z

brandond
Apr 18, 2021
Collaborator

The pod replacement stuff is all core Kubernetes behavior, and can be tuned with args to the scheduler or controller-manager. For minimum downtime, you're probably better off ensuring that there are more replicas running on other nodes - due to its eventual-consistency model, Kubernetes is never going to offer instantaneous replacement of pods lost due to nodes unexpectedly going offline.

0 replies

ruohki · 2021-04-18T20:06:57Z

ruohki
Apr 18, 2021
Author

Okay, but is it save to scale up things like coredns or the traefik ingress? I pretty much run everything the way it was installed by k3s

0 replies

eplightning · 2021-04-18T21:01:49Z

eplightning
Apr 18, 2021

Both CoreDNS and Traefik are stateless so it's safe

0 replies

ruohki · 2021-04-18T21:37:48Z

ruohki
Apr 18, 2021
Author

but things are installed via helm and rancher gives me a warning that things get overwritten. Is there a way to make that setting persist after updates?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triple Master Cluster with Embedded Etcd HA becomes unavailable 1 node dies #3218

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Triple Master Cluster with Embedded Etcd HA becomes unavailable 1 node dies #3218

ruohki Apr 18, 2021

Replies: 7 comments

brandond Apr 18, 2021 Collaborator

ruohki Apr 18, 2021 Author

eplightning Apr 18, 2021

brandond Apr 18, 2021 Collaborator

ruohki Apr 18, 2021 Author

eplightning Apr 18, 2021

ruohki Apr 18, 2021 Author

ruohki
Apr 18, 2021

brandond
Apr 18, 2021
Collaborator

ruohki
Apr 18, 2021
Author

eplightning
Apr 18, 2021

brandond
Apr 18, 2021
Collaborator

ruohki
Apr 18, 2021
Author

eplightning
Apr 18, 2021

ruohki
Apr 18, 2021
Author