Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashborard API server availability (30d) is showing more than 100% #2349

Open
ssharma2089 opened this issue Feb 13, 2024 · 1 comment
Labels

Comments

@ssharma2089
Copy link

I have upgraded the kube-prometheus from v0.12 to v0.13 and API server availability is showing more than 100%. In the previous version, its showing correctly.

Attaching image for dashboard

grafana_apiserver

Environment

  • Kubernetes version information:
    Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.7+k3s1", GitCommit:"8432d7f239676dfe8f748c0c2a3fabf8cf40a826", GitTreeState:"clean", BuildDate:"2022-02-24T23:03:47Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.8+k3s2", GitCommit:"02fcbd1f57f0bc0ca1dc68f98cfa0e7d3b008225", GitTreeState:"clean", BuildDate:"2023-12-07T02:48:20Z", GoVersion:"go1.20.11", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

k3s

  • Prometheus Logs:

ts=2024-02-13T11:59:38.233Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=HEAD, revision=cbb69e51423565ec40f46e74f4ff2dbb3b7fb4f0)"
ts=2024-02-13T11:59:38.233Z caller=main.go:590 level=info build_context="(go=go1.20.6, platform=linux/amd64, user=root@42454fc0f41e, date=20230725-12:31:24, tags=netgo,builtinassets,stringlabels)"
ts=2024-02-13T11:59:38.233Z caller=main.go:591 level=info host_details="(Linux 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 prometheus-k8s-0 (none))"
ts=2024-02-13T11:59:38.233Z caller=main.go:592 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2024-02-13T11:59:38.233Z caller=main.go:593 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-02-13T11:59:38.235Z caller=web.go:563 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2024-02-13T11:59:38.236Z caller=main.go:1026 level=info msg="Starting TSDB ..."
ts=2024-02-13T11:59:38.237Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
ts=2024-02-13T11:59:38.238Z caller=head.go:595 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2024-02-13T11:59:38.260Z caller=head.go:676 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.959µs
ts=2024-02-13T11:59:38.260Z caller=head.go:684 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2024-02-13T11:59:38.260Z caller=tls_config.go:313 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2024-02-13T11:59:38.261Z caller=head.go:755 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2024-02-13T11:59:38.261Z caller=head.go:792 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=64.923µs wal_replay_duration=591.788µs wbl_replay_duration=142ns total_replay_duration=702.782µs
ts=2024-02-13T11:59:38.261Z caller=main.go:1047 level=info fs_type=XFS_SUPER_MAGIC
ts=2024-02-13T11:59:38.261Z caller=main.go:1050 level=info msg="TSDB started"
ts=2024-02-13T11:59:38.261Z caller=main.go:1231 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2024-02-13T11:59:38.290Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-cadvisor msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-pods msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/alertmanager-main/1 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.291Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/coredns/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kafka-service-monitor/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-apiserver/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-services msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.292Z caller=kubernetes.go:329 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:38.432Z caller=main.go:1268 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=170.302838ms db_storage=1.527µs remote_storage=1.349µs web_handler=561ns query_engine=792ns scrape=316.7µs scrape_sd=2.107591ms notify=25.468µs notify_sd=176.505µs rules=139.444513ms tracing=10.148µs
ts=2024-02-13T11:59:38.432Z caller=main.go:1011 level=info msg="Server is ready to receive web requests."
ts=2024-02-13T11:59:38.432Z caller=manager.go:1009 level=info component="rule manager" msg="Starting rule manager..."
ts=2024-02-13T11:59:42.826Z caller=main.go:1231 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2024-02-13T11:59:42.849Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-state-metrics/1 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.851Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/coredns/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.852Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kube-apiserver/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.852Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-pods msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.853Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=serviceMonitor/monitoring/kafka-service-monitor/0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-cadvisor msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager scrape" discovery=kubernetes config=kubernetes-services msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:42.854Z caller=kubernetes.go:329 level=info component="discovery manager notify" discovery=kubernetes config=config-0 msg="Using pod service account via in-cluster config"
ts=2024-02-13T11:59:43.006Z caller=main.go:1268 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=179.767429ms db_storage=2.059µs remote_storage=1.432µs web_handler=667ns query_engine=978ns scrape=65.891µs scrape_sd=5.288807ms notify=16.27µs notify_sd=171.972µs rules=150.995181ms tracing=6.049µs

@shun095
Copy link

shun095 commented Sep 8, 2024

I met a similar problem in k3s home lab environment.
Like following issue, by adding {job="apiserver"} in following section solves the problem.
#2465

(Other terms in the fomula are filterd by {job="apiserver"} filter, but only the apiserver_request_sli_duration_seconds_bucket is not filterd.)

https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml

$ git diff | cat                                                                                                                                                              diff --git a/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml b/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
index 27399b19..e978af06 100644                                                                                                                                               --- a/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
+++ b/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/kube-apiserver-availability.rules.yaml
@@ -82,7 +82,7 @@ spec:
           {{- toYaml . | nindent 8 }}
         {{- end }}
       {{- end }}
-    - expr: sum by ({{ range $.Values.defaultRules.additionalAggregationLabels }}{{ . }},{{ end }}cluster, verb, scope, le) (increase(apiserver_request_sli_duration_seconds_bucket[1h]))
+    - expr: sum by ({{ range $.Values.defaultRules.additionalAggregationLabels }}{{ . }},{{ end }}cluster, verb, scope, le) (increase(apiserver_request_sli_duration_seconds_bucket{job="apiserver"}[1h]))
       record: cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h
       {{- if or .Values.defaultRules.additionalRuleLabels .Values.defaultRules.additionalRuleGroupLabels.kubeApiserverAvailability }}
       labels:

K3s has several components in a single process, so metrics of Kubernetes components duplicates between different Prometheus jobs. So, this filter was important in my case.
k3s-io/k3s#2262

In my k3s cluster, apiserver_request_sli_duration_seconds_bucket metrics are collected by following 5 jobs. ( and availability was indicated around 500% :) )

  • kube-proxy
  • kube-controller-manager
  • kube-scheduler
  • apiserver
  • kubelet

I guess current latest code which forgets adding job="apiserver" filtering probably doesn't cause problem in normal kubeadm cluster, so this issue may not have been paid attention so much.
(I haven't checked if it's correct because I don't have a cluster for test which built by kubeadm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants