Table of Contents
[TOC]
Alertmanager is getting errors trying to send alerts. Alerts will be lost.
Check the AlertManager logs to find out why it could not send alerts.
In the gitlab-ops
project of Google Cloud, open the Workloads
section under
the Kubernetes Engine
section of the web console. Select the Alertmanager
workload, named alertmanager-gitlab-monitoring-promethe-alertmanager
. Here
you can see details for the Alertmanager pods and select Container logs
to review the logs.
The AlertManager pod is very quiet except for errors so it should be quickly obvious if it could not contact a service.
Note the "integration" label on the alert. If it's only one integration it's probably a problem with the setup of that integration.
For example if it's slack you can get the API key by looking for "infra-automation app Slack API Token" in 1password.
And you can test it with curl
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Ceci cest un test."}' \
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
If it receives a 404 result then the channel does not exist. See slack docs for other possible error codes.
For more information see https://api.slack.com/incoming-webhooks
- In Prometheus, run this query:
rate(alertmanager_notifications_failed_total[10m])
. - This will give you a breakdown of which integration is failing, and from which server.
- For the slackline, you can view the
alertManagerBridge
cloud function, its logs, and code. - Keep in mind that, if nothing has changed, the problem is likely to be on the remote side - for example, a Slack or Pagerduty issue.
- Open the alert-manager UI: https://alerts.gitlab.net/
- Review each alert to check if it's notification has failed and whether further action is required.