Having trouble installing/accessing Rancher after deploying a cluster with this playbook #227
-
A little while ago (12 commits ago, if that's a good measure of time), I used this playbook to successfully deploy a cluster (3 masters, 2 nodes) and deploy Rancher in it. This is all on Ubuntu 22.04 cloud-init virtual machines hosted by ESXi 7.0.3. Keep in mind, I am absolute newbie when it comes to kubernetes -- I'm trying to set this up for the purpose of learning. Today though, for some reason, I can't get any further than deploying the cluster. I'm deploying it on 3 master nodes (2 CPUs, 6GB memory) and 3 server nodes (4 CPUs, 8GB memory). They all have sufficient disk space. The playbook does not show any failures, and I have no trouble applying manifests (such as the test Nginx deployment shown in the guide) or interacting with However, each time I've tried to install Rancher with cert-manager (I must have tried nearly a dozen times today), the installation "succeeds", but I can't access Rancher via my external Nginx load balancer, configured in exactly the same way as documented in Tim's guide for HA Rancher. The troubleshooting guide hasn't helped me resolved my problem unfortunately, although that mostly makes sense since I don't think this is a problem with the playbook itself. My inventory looks like this: [master]
192.168.110.211
192.168.110.212
192.168.110.213
[node]
192.168.110.221
192.168.110.222
192.168.110.223
[k3s_cluster:children]
master
node Group vars: ---
k3s_version: v1.24.10+k3s1
# this is the user that has ssh access to these machines
ansible_user: svc_conman
systemd_dir: /etc/systemd/system
# Set your timezone
system_timezone: "Etc/UTC"
# interface which will be used for flannel
flannel_iface: "ens192"
# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.110.210"
# k3s_token is required masters can talk together securely
# this token should be alpha numeric only
k3s_token: "YeYfdnsEm7sHMq88dNZsWwVTFs9QFEGH9r9XLBEisimPnhmX"
# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'
# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"
# these arguments are recommended for servers as well as agents:
extra_args: >-
--flannel-iface={{ flannel_iface }}
--node-ip={{ k3s_node_ip }}
# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
{{ extra_args }}
{{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
--tls-san {{ apiserver_endpoint }}
--disable servicelb
--disable traefik
extra_agent_args: >-
{{ extra_args }}
# image tag for kube-vip
kube_vip_tag_version: "v0.5.7"
# metallb type frr or native
metal_lb_type: "native"
# metallb mode layer2 or bgp
metal_lb_mode: "layer2"
# image tag for metal lb
metal_lb_frr_tag_version: "v7.5.1"
metal_lb_speaker_tag_version: "v0.13.7"
metal_lb_controller_tag_version: "v0.13.7"
# metallb ip range for load balancer
metal_lb_ip_range: "192.168.110.80-192.168.110.100"
proxmox_lxc_configure: false My external load balancer's IP is 192.168.110.133 -- its
Note: I've tried setting the upstream servers to my master servers (rather than the agent nodes), but this hasn't solved my problem. This doesn't exactly surprise me since, as I understand it, the master nodes get set up with taints to prevent them from getting assigned any "non-system" pods, and to my understanding, Rancher pods are not "system" pods. Please correct me if I'm wrong. After I deploy the cluster using kubectl create namespace cattle-system
kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/v1.7.1/cert-manager.crds.yaml
helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v1.7.1 Best I can tell, this succeeds -- pods are running: $ kubectl -n cert-manager get pods
NAME READY STATUS RESTARTS AGE
cert-manager-646c67487-bjcps 1/1 Running 0 61s
cert-manager-cainjector-7cb8669d6b-jz5bj 1/1 Running 0 61s
cert-manager-webhook-696c5db7ff-8fdtr 1/1 Running 0 61s So I move on to installing Rancher (I replace helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher.example.com I wait for the rollout to finish -- this takes about 10 minutes (which honestly feels about twice as long as I remember it taking the first time I succeeded in doing this -- the aforementioned 12 commits ago). Not to mention, it takes about 7 of those 10 minutes for the first replica to become available; the other two follow relatively quickly. Nevertheless, the output indicates the rollout is successful: $ kubectl -n cattle-system rollout status deploy/rancher
Waiting for deployment "rancher" rollout to finish: 0 of 3 updated replicas are available...
Waiting for deployment "rancher" rollout to finish: 1 of 3 updated replicas are available...
Waiting for deployment "rancher" rollout to finish: 2 of 3 updated replicas are available...
deployment "rancher" successfully rolled out However, once I navigate to
Which also seem to be a bit numerous: $ kubectl -n cattle-system get pods
NAME READY STATUS RESTARTS AGE
helm-operation-dt4jr 1/2 Error 0 10m
helm-operation-fqj6p 1/2 Error 0 11m
helm-operation-hrkvp 0/2 Completed 0 10m
helm-operation-j4j5b 0/2 Completed 0 10m
helm-operation-j6btt 0/2 Completed 0 15m
helm-operation-jq8z8 1/2 Error 0 10m
helm-operation-ljwlh 0/2 Completed 0 13m
helm-operation-tf49r 0/2 Completed 0 11m
helm-operation-tsqrp 1/2 Error 0 12m
helm-operation-x4vtw 0/2 Completed 0 10m
rancher-6757f6b675-r2zp9 1/1 Running 0 22m
rancher-6757f6b675-t7kn9 1/1 Running 0 22m
rancher-6757f6b675-xc7qv 1/1 Running 0 22m
rancher-webhook-577b778f8f-dqfsw 1/1 Running 0 10m The certificate is getting issued properly: $ kubectl -n cattle-system describe certificate
<snip>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Issuing 19m cert-manager Issuing certificate as Secret does not exist
Normal Generated 19m cert-manager Stored new private key in temporary Secret resource "tls-rancher-ingress-kcdqp"
Normal Requested 19m cert-manager Created new CertificateRequest resource "tls-rancher-ingress-ktlkh"
Normal Issuing 9m49s cert-manager The certificate has been successfully issued But the issuer might not be as happy: $ kubectl -n cattle-system describe issuer
<snip>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ErrGetKeyPair 15m (x7 over 21m) cert-manager Error getting keypair for CA issuer: secret "tls-rancher" not found
Warning ErrInitIssuer 15m (x7 over 21m) cert-manager Error initializing issuer: secret "tls-rancher" not found
Normal KeyPairVerified 10m (x2 over 10m) cert-manager Signing CA verified What's interesting to me is there is no $ kubectl get svc rancher -n cattle-system -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
rancher ClusterIP 10.43.254.134 <none> 80/TCP,443/TCP 25m app=rancher Is that normal, even though I'm trying to use an external load balancer to expose Rancher? I'd appreciate any help troubleshooting this problem -- if you need any more info, please let me know. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 13 replies
-
I figured you might want to see the pod logs (see below). Curiously they all seem to be vastly different.
|
Beta Was this translation helpful? Give feedback.
-
I believe this is because Rancher is not compatible yet with the newest versions of kubernetes |
Beta Was this translation helpful? Give feedback.
-
After doing some more testing it appears my problems might be stemming from the lack of an ingress controller. This playbook specifically disables the built-in traefik ingress controller that normally gets installed with k3s. This playbook has always behaved this way as far as I can tell, which is confusing to me because as I mentioned in my OP, last time I followed the steps to install the cluster and then install Rancher on it, I never had to set up my own ingress controller -- I was under the impression that the Nginx load balancer that Tim sets up in his HA Rancher on Kubernetes guide (that I have also set up in the same way) would essentially perform that function. However, Rancher isn't binding/listening on any ports on the nodes themselves, so of course it's not working; Nginx can't communicate with the nodes on port 80 or 443, as evidenced by these errors in my Nginx logs:
As a test, after installing Rancher, I manually modified the service manifest to use Does anyone have any ideas? Would @timothystewart6 be able to weigh in? I hope you don't mind the direct mention. |
Beta Was this translation helpful? Give feedback.
You should not use an external load balancer, that's what metal lb is for. You can choose to use either traefik or metal lb #227 (reply in thread)