Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-server is unable to scrape node when using custom bind-address #10749

Closed
2tefan opened this issue Aug 23, 2024 · 15 comments
Closed

metrics-server is unable to scrape node when using custom bind-address #10749

2tefan opened this issue Aug 23, 2024 · 15 comments

Comments

@2tefan
Copy link

2tefan commented Aug 23, 2024

Environmental Info:
K3s Version:

It happens on the currently newest available version:

k3s version v1.30.4+k3s1 (98262b5d)
go version go1.22.5

But this issue first occurred in this version:

k3s-1.29.6 version v1.29.6+k3s1 (83ae095a)
go version go1.21.11

This is the last version it worked with:

k3s version v1.29.5+k3s1 (4e53a323)
go version go1.21.9

I think this behaviour was introduced by #10019

Node(s) CPU architecture, OS, and Version:

❯ uname -a
Linux linux-spisjak 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
❯ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04 LTS
Release:	24.04
Codename:	noble

Cluster Configuration:

Single-node k3s cluster with the following config:

❯ cat /etc/rancher/k3s/config.yaml
advertise-address: "100.64.0.1"
bind-address: "100.64.0.1"
data-dir: "/tmp/test-k3s"

Describe the bug:

After k3s v1.29.6+k3s1 and when specifying a custom bind-address the metrics server is unable to connect to the Kubelet metrics.

Logs of metrics server before `v1.29.6+k3s1` with this config:
I0823 06:59:41.729162       1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0823 06:59:42.063000       1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0823 06:59:42.169382       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0823 06:59:42.169407       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0823 06:59:42.169404       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0823 06:59:42.169404       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0823 06:59:42.169424       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0823 06:59:42.169427       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0823 06:59:42.169993       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0823 06:59:42.170906       1 secure_serving.go:213] Serving securely on [::]:10250
I0823 06:59:42.170960       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0823 06:59:42.269809       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0823 06:59:42.269819       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0823 06:59:42.269897       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0823 06:59:42.830673       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:43.817880       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:43.831413       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:45.816043       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:47.814920       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:49.818610       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:51.813792       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:53.815532       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 06:59:55.816636       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

Kubelet metrics are available at *:10250 before v1.29.6+k3s1 with this config:

❯ sudo ss -tulpn | grep 10250
tcp   LISTEN 0      4096                                           *:10250            *:*    users:(("k3s-server",pid=1623615,fd=178))

After v1.29.6+k3s1 and newer, this socket binds to <bind-address>:10250 instead:

❯ sudo ss -tulpn | grep 10250                                                                                                                                                                                 09:05:35
tcp   LISTEN 0      4096                                  100.64.0.1:10250      0.0.0.0:*    users:(("k3s-server",pid=1774664,fd=214))
This leads the metrics server unable to scrape the node, as it tries to connect via the node ip:
I0823 07:09:45.054638       1 serving.go:374] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0823 07:09:45.432437       1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
E0823 07:09:45.536256       1 scraper.go:149] "Failed to scrape node" err="Get \"https://172.20.32.116:10250/metrics/resource\": dial tcp 172.20.32.116:10250: connect: connection refused" node="linux-spisjak"
I0823 07:09:45.537788       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0823 07:09:45.537799       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0823 07:09:45.537815       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0823 07:09:45.537811       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0823 07:09:45.537830       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0823 07:09:45.537803       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0823 07:09:45.538364       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0823 07:09:45.539489       1 secure_serving.go:213] Serving securely on [::]:10250
I0823 07:09:45.539544       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0823 07:09:45.612030       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:45.638045       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0823 07:09:45.638050       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0823 07:09:45.638093       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0823 07:09:45.972414       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:46.615794       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:47.972185       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:49.973066       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:51.974140       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:53.972596       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:55.972641       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:57.972040       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:09:59.973211       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0823 07:10:00.539378       1 scraper.go:149] "Failed to scrape node" err="Get \"https://172.20.32.116:10250/metrics/resource\": dial tcp 172.20.32.116:10250: connect: connection refused" node="linux-spisjak"
I0823 07:10:01.971316       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:03.971801       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:05.972014       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:07.970799       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:09.972003       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:11.972176       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0823 07:10:13.970864       1 server.go:191] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0823 07:10:15.537916       1 scraper.go:149] "Failed to scrape node" err="Get \"https://172.20.32.116:10250/metrics/resource\": dial tcp 172.20.32.116:10250: connect: connection refused" node="linux-spisjak"

Steps To Reproduce:

Download k3s v1.29.6+k3s1 or newer:

mkdir test-k3s
cd test-k3s
wget https://github.com/k3s-io/k3s/releases/download/v1.29.6%2Bk3s1/k3s
chmod u+x k3s

Use this config (/etc/rancher/k3s/config.yaml):

advertise-address: "100.64.0.1"
bind-address: "100.64.0.1"
data-dir: "/tmp/test-k3s"

Add IP to loopback interface

sudo ip addr add 100.64.0.1 dev lo

(Verify it:)

❯ ip -br a | grep lo
lo               UNKNOWN        127.0.0.1/8 100.64.0.1/32 ::1/128

Start k3s:

sudo ./k3s server

Check that it binds the bind-address:

ss -tulpn | grep 10250

View logs of metrics-server:

❯ k -nkube-system logs deployments/metrics-server
[...]
E0823 07:15:00.538805       1 scraper.go:149] "Failed to scrape node" err="Get \"https://172.20.32.116:10250/metrics/resource\": dial tcp 172.20.32.116:10250: connect: connection refused" node="linux-spisjak"
[...]

Expected behavior:

I would expect the metrics-server to not use the node ip (in this case 172.20.32.116), but the bind-address instead.

Actual behavior:

The metrics-server uses the node ip.

@brandond
Copy link
Member

brandond commented Aug 29, 2024

  • advertise-address only applies to the apiserver, not the kubelet.
  • bind-address now reliably configures the bind address for managed components, in absence of a configured node-ip.
  • The metrics-server scrapes the node IP - using the value that shows up in kubectl get node -o wide. If you're overriding the bind-address, you should set the node-ip to the same value.

@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Aug 29, 2024
@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

@brandond, I'm looking at the server documentation and I cannot see a node-ip option. Is this inherited automatically from agent node-ip (in other words, I can define it in both server and agent config.yaml), or do I have to specifically define a bind-address on servers?

@brandond
Copy link
Member

brandond commented Dec 11, 2024

Second sentence of the section you linked:

Note that servers also run an agent, so all of the configuration options listed in the k3s agent documentation are also supported on servers.

Don't try to run both agent and server on the same host. As that says, servers also include agent functionality.

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

Thank you @brandond, I'm running the server on 3 control-planes and agent on 5 workers. I'll set the node-ip: specific-node-ip-address-here into each node /etc/rancher/k3s/config.yaml.

@brandond
Copy link
Member

Yes

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

I just tried to set the node-ip on server config.yaml and is complaining, I cannot start the service. Looks like I have to set the bind-address on server and node-ip on agent hosts.

Edit: further more, bind-address sets only the supervisor API bind address. In order to fix the controller and scheduler connection fails I see in VictoriaMetrics, I believe I need to add also into server config.yaml:

kube-controller-manager-arg: bind-address=IP_ADDRESS
kube-scheduler-arg: bind-address=IP_ADDRESS

Warning samples I get in vmagent logs with cannot read data from https://IP_ADDRESS:10250/metrics, which directed me to this ticket:

2024-12-11T04:34:33.572Z warn VictoriaMetrics/lib/promscrape/scrapework.go:383 cannot scrape target "https://192.168.4.3:10250/metrics" ({beta_kubernetes_io_arch="arm64",beta_kubernetes_io_os="linux",instance="boreas",job="kubelet",kubernetes_io_arch="arm64",kubernetes_io_hostname="boreas",kubernetes_io_os="linux",metrics_path="/metrics",node="boreas",node_role_kubernetes_io_control_plane="true",node_role_kubernetes_io_etcd="true",node_role_kubernetes_io_master="true"}) 1 out of 4 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: cannot perform request to "https://192.168.4.3:10250/metrics": Get "https://192.168.4.3:10250/metrics": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2024-12-11T04:34:53.730Z warn VictoriaMetrics/lib/promscrape/scrapework.go:383 cannot scrape target "https://192.168.4.5:10250/metrics/cadvisor" ({beta_kubernetes_io_arch="arm64",beta_kubernetes_io_os="linux",instance="chaos",job="kubelet",kubernetes_io_arch="arm64",kubernetes_io_hostname="chaos",kubernetes_io_os="linux",metrics_path="/metrics/cadvisor",node="chaos"}) 1 out of 3 times during -promscrape.suppressScrapeErrorsDelay=0s; the last error: cannot read data from https://192.168.4.5:10250/metrics/cadvisor: context deadline exceeded (Client.Timeout or context cancellation while reading body)

Also, I see the supervisor-metrics: true option was not added to documentation yet, I see #9951 merged already and we run 1.31.3+k3s1 as stable version.

@brandond
Copy link
Member

brandond commented Dec 11, 2024

Why are you setting all these things? What are you trying to do? If you just leave the node ip and bind address unset it should do the correct thing. The supervisor metrics are not even related to the kubelet metrics you appear to be trying to scrape.

Make sure port 10250 is open between all your nodes, and that they can reach each other at their listed addresses.

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

@brandond I need to set the bind-address because I'm running HAProxy on two of control-planes.

# kubectl get nodes -o wide
NAME     STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
apollo   Ready    control-plane,etcd,master   40m   v1.31.3+k3s1   192.168.4.2   <none>        Ubuntu 24.04.1 LTS   6.8.0-1016-raspi   containerd://1.7.23-k3s2
boreas   Ready    control-plane,etcd,master   38m   v1.31.3+k3s1   192.168.4.3   <none>        Ubuntu 24.04.1 LTS   6.8.0-1016-raspi   containerd://1.7.23-k3s2
cerus    Ready    control-plane,etcd,master   37m   v1.31.3+k3s1   192.168.4.4   <none>        Ubuntu 24.04.1 LTS   6.8.0-1016-raspi   containerd://1.7.23-k3s2
chaos    Ready    <none>                      36m   v1.31.3+k3s1   192.168.4.5   <none>        Ubuntu 24.04.1 LTS   6.8.0-1015-raspi   containerd://1.7.23-k3s2
crios    Ready    <none>                      36m   v1.31.3+k3s1   192.168.4.6   <none>        Ubuntu 24.04.1 LTS   6.8.0-1016-raspi   containerd://1.7.23-k3s2
helios   Ready    <none>                      36m   v1.31.3+k3s1   192.168.4.7   <none>        Ubuntu 24.04.1 LTS   6.8.0-1016-raspi   containerd://1.7.23-k3s2
hermes   Ready    <none>                      36m   v1.31.3+k3s1   192.168.4.8   <none>        Ubuntu 24.04.1 LTS   6.8.0-1015-raspi   containerd://1.7.23-k3s2
hypnos   Ready    <none>                      36m   v1.31.3+k3s1   192.168.4.9   <none>        Ubuntu 24.04.1 LTS   6.8.0-1015-raspi   containerd://1.7.23-k3s2

I think I know where is the issue, port 10250 is properly assigned for all server nodes:

# ss -tulpn | grep 10250
tcp   LISTEN 0      4096        192.168.4.2:10250      0.0.0.0:*    users:(("k3s-server",pid=5576,fd=203))

On agent nodes, it is not assigned to a specific IP address, VictoriaMetrics is trying to connect to 192.168.4.5:10250 (I presume *:10250 means allow connections on any IP:10250):

# ss -tulpn | grep 10250
tcp   LISTEN 0      4096                  *:10250            *:*    users:(("k3s-agent",pid=2759,fd=40))

I was wondering what is the correct way to define a specific IP into agent nodes for 10250 port. Here is a config.yaml example, created on each agent:

# cat /etc/rancher/k3s/config.yaml
node-ip: 192.168.4.5
node-taint:
  - node.cilium.io/agent-not-ready:NoExecute
server: https://192.168.4.10:6443 <-- HAProxy loadbalancer
token: redacted

Looking at metrics-server logs, I see several of these errors:

E1211 17:25:41.592599 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.592225 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.593836 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.593941 1 timeout.go:142] post-timeout activity - time-elapsed: 4.029774ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E1211 17:25:41.593834 1 timeout.go:142] post-timeout activity - time-elapsed: 7.653958ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E1211 17:25:41.593852 1 timeout.go:142] post-timeout activity - time-elapsed: 9.056043ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E1211 17:25:42.416837 1 timeout.go:142] post-timeout activity - time-elapsed: 6.445µs, GET "/apis" result: <nil>
E1211 17:25:43.788140 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.8:10250/metrics/resource\": read tcp 10.42.0.16:35522->192.168.4.8:10250: read: connection reset by peer" node="hermes"
E1211 17:25:43.798460 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.2:10250/metrics/resource\": read tcp 10.42.0.16:53888->192.168.4.2:10250: read: connection reset by peer" node="apollo"
E1211 17:25:41.592599 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.592225 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.593836 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 17:25:41.593941 1 timeout.go:142] post-timeout activity - time-elapsed: 4.029774ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E1211 17:25:43.788140 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.8:10250/metrics/resource\": read tcp 10.42.0.16:35522->192.168.4.8:10250: read: connection reset by peer" node="hermes"
E1211 17:25:43.798460 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.2:10250/metrics/resource\": read tcp 10.42.0.16:53888->192.168.4.2:10250: read: connection reset by peer" node="apollo"
E1211 17:25:43.807344 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.6:10250/metrics/resource\": read tcp 10.42.0.16:52604->192.168.4.6:10250: read: connection reset by peer" node="crios"
E1211 17:25:43.808061 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.7:10250/metrics/resource\": read tcp 10.42.0.16:50914->192.168.4.7:10250: read: connection reset by peer" node="helios"
E1211 17:25:43.811865 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.9:10250/metrics/resource\": read tcp 10.42.0.16:32952->192.168.4.9:10250: read: connection reset by peer" node="hypnos"

BTW, thank you for taking the time to provide support to everyone for many k3s related issues, it is really appreciated.

@brandond
Copy link
Member

It looks like its listening on the correct ports, however something is rejecting connections to that port on the nodes, from whatever node that pod is running on.

If you have local firewalls on the nodes (ufw/firewalld) or security groups, make sure that port 10250 is open.

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

Firewall is disabled, running Ubuntu LTS 24.04 on all nodes. I'm going to set an address on kubelet-arg and see if anything changes.

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

@brandond this actually breaks the service but gives some clarity where the issue might be:

Error getting the current node from lister" err="node \"chaos\" not found"

Ubuntu is setting the hostname to:

# host chaos
chaos.local has address 127.0.1.1
# cat /etc/hostname
chaos
# cat /etc/hosts
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.debian.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 chaos.local chaos
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

How would you deal with the 127.0.1.1 usage? Can I set to get current node from lister as localhost?

Systemd service output:

# journalctl -xeu k3s-agent.service
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.501506    3189 server.go:163] "Starting to listen" address="192.168.4.5" port=10250
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.502197    3189 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources" qps=100 burstTokens=10
Dec 11 15:37:09 chaos k3s[3189]: E1211 15:37:09.502477    3189 server.go:886] "Failed to start healthz server" err="listen tcp 127.0.0.1:10248: bind: address already in use"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.503076    3189 server.go:236] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.505196    3189 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.505952    3189 dynamic_serving_content.go:135] "Starting controller" name="kubelet-server-cert-files::/var/lib/rancher/k3s/agent/serving-kubelet.crt::/var/lib/rancher/k3s/agent/serving-kubelet.key"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.507847    3189 server.go:460] "Adding debug handlers to kubelet server"
Dec 11 15:37:09 chaos k3s[3189]: E1211 15:37:09.509097    3189 kubelet_node_status.go:453] "Error getting the current node from lister" err="node \"chaos\" not found"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.509284    3189 volume_manager.go:289] "Starting Kubelet Volume Manager"
Dec 11 15:37:09 chaos k3s[3189]: I1211 15:37:09.510191    3189 desired_state_of_world_populator.go:146] "Desired state populator starts to run"
Dec 11 15:37:09 chaos k3s[3189]: E1211 15:37:09.510379    3189 server.go:185] "Failed to listen and serve" err="listen tcp 192.168.4.5:10250: bind: address already in use"

@fmunteanu
Copy link

fmunteanu commented Dec 11, 2024

Removing the node-ip from agent configuration fixes the issue:

Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Starting k3s agent v1.31.3+k3s1 (6e6af988)"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.4.10:6443"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.4.3:6443"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.4.4:6443"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 192.168.4.2:6443"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 192.168.4.10:6443"
Dec 11 16:48:22 chaos k3s[16812]: time="2024-12-11T16:48:22-05:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [192.168.4.2:6443 192.168.4.3:6443 192.168.4.4:6443] [default: 192.168.4.10:6443]"

The metrics-server logs are still complaining:

E1211 21:35:23.707860 1 timeout.go:142] post-timeout activity - time-elapsed: 3.330204ms, GET "/openapi/v2" result: <nil>
E1211 21:35:24.499544 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.4:10250/metrics/resource\": read tcp 10.42.0.28:45144->192.168.4.4:10250: read: connection reset by peer" node="cerus"
E1211 21:35:24.499874 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.2:10250/metrics/resource\": read tcp 10.42.0.28:44572->192.168.4.2:10250: read: connection reset by peer" node="apollo"
E1211 21:35:24.510387 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.3:10250/metrics/resource\": read tcp 10.42.0.28:37614->192.168.4.3:10250: read: connection reset by peer" node="boreas"
E1211 21:35:24.522427 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.7:10250/metrics/resource\": read tcp 10.42.0.28:41410->192.168.4.7:10250: read: connection reset by peer" node="helios"
E1211 21:35:24.524486 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.9:10250/metrics/resource\": read tcp 10.42.0.28:38166->192.168.4.9:10250: read: connection reset by peer" node="hypnos"
E1211 21:35:24.531330 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.8:10250/metrics/resource\": read tcp 10.42.0.28:45956->192.168.4.8:10250: read: connection reset by peer" node="hermes"
E1211 21:35:24.531466 1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.4.6:10250/metrics/resource\": read tcp 10.42.0.28:48042->192.168.4.6:10250: read: connection reset by peer" node="crios"
E1211 21:35:29.854647 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E1211 21:35:29.854735 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E1211 21:35:29.855961 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E1211 21:35:29.856040 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E1211 21:35:29.856587 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E1211 21:35:29.856685 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 21:35:29.857047 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E1211 21:35:29.857155 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E1211 21:35:29.858052 1 timeout.go:142] post-timeout activity - time-elapsed: 3.649203ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>
E1211 21:35:29.858345 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout
E1211 21:35:29.858525 1 timeout.go:142] post-timeout activity - time-elapsed: 2.443561ms, GET "/apis/metrics.k8s.io/v1beta1" result: <nil>

I get a 401:

$ curl -Ik https://192.168.4.5:10250/metrics/resource
HTTP/2 401
content-type: text/plain; charset=utf-8
content-length: 12
date: Wed, 11 Dec 2024 21:55:01 GMT
$ curl -k https://192.168.4.5:10250/metrics/resource
Unauthorized

$ curl -sX GET http://127.0.0.1:8001/api/v1/nodes/chaos/proxy/configz | jq .
{
  "kubeletconfig": {
    "enableServer": true,
    "staticPodPath": "/var/lib/rancher/k3s/agent/pod-manifests",
    "podLogsDir": "/var/log/pods",
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCertFile": "/var/lib/rancher/k3s/agent/serving-kubelet.crt",
    "tlsPrivateKeyFile": "/var/lib/rancher/k3s/agent/serving-kubelet.key",
    "authentication": {
      "x509": {
        "clientCAFile": "/var/lib/rancher/k3s/agent/client-ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 50,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "10.43.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageMaximumGCAge": "0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "none",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 110,
    "podPidsLimit": -1,
    "resolvConf": "/run/systemd/resolve/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": false,
    "evictionHard": {
      "imagefs.available": "15%",
      "imagefs.inodesFree": "5%",
      "memory.available": "500Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionSoft": {
      "imagefs.available": "30%",
      "imagefs.inodesFree": "10%",
      "memory.available": "1Gi",
      "nodefs.available": "30%",
      "nodefs.inodesFree": "10%"
    },
    "evictionSoftGracePeriod": {
      "imagefs.available": "2m",
      "imagefs.inodesFree": "2m",
      "memory.available": "1m30s",
      "nodefs.available": "2m",
      "nodefs.inodesFree": "2m"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "evictionMaxPodGracePeriod": 60,
    "evictionMinimumReclaim": {
      "imagefs.available": "15%",
      "imagefs.inodesFree": "5%",
      "memory.available": "500Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "failSwapOn": false,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "containerLogMaxWorkers": 1,
    "containerLogMonitorInterval": "10s",
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "systemReserved": {
      "cpu": "1",
      "ephemeral-storage": "5Gi",
      "memory": "1Gi"
    },
    "enforceNodeAllocatable": [
      "pods"
    ],
    "allowedUnsafeSysctls": [
      "net.ipv4.ip_forward",
      "net.ipv6.conf.all.forwarding"
    ],
    "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 0,
      "options": {
        "text": {
          "infoBufferSize": "0"
        },
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerWithTaints": [
      {
        "key": "node.cilium.io/agent-not-ready",
        "effect": "NoExecute"
      }
    ],
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/k3s/containerd/containerd.sock",
    "failCgroupV1": false
  }
}

@fmunteanu
Copy link

fmunteanu commented Dec 13, 2024

@brandond I figured where is the issue, K3s disables kublelet anonymous authentication. Also, I decided to disable metrics-servers and use a standalone deployment, which fixes all the Failed to scrape node errors, the metrics-server logs are clean now.

I still have an issue scraping in VictoriaMetrics, only on servers. Getting 403's instead of 401's, when I enable anonymous-auth in kubelet-args. This is temporary, until I work the details with VictoriaMetrics people. I posted all details into this comment. I'm going to open a new K3s issue if is OK with you. Hopefully, your ninja skills will open my eyes where is the 403 related problem.

@brandond
Copy link
Member

Don't turn on anonymous auth. That is the wrong way to solve the problem. Metrics-server and other scrapers should use service accounts to authenticate.

@fmunteanu
Copy link

Yes, I definitely don't want to do that. I simply wanted to see if there is an additional connectivity problem that anonymous auth might mask. I'll open a new issue and post all technical details in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants