Table of Contents
[TOC]
Our metrics and notification systems are all configurable to help engineers be aware of the status of our environments. When you are creating, removing, or changing notification parameters, keep these questions in mind:
- Who needs to see this notification?
- What actions are expected of the recipient of this notification?
- What level of immediacy is required to this notification?
- Is this notification a strong indicator of a problem, or just a likely indicator that something may be wrong?
- A video walkthrough of this runbook entry
- Tuning Camoproxy's Loadbalancer SLI Demo
- An impatient SRE's guide to deleting alerts
- Apdex alerts troubleshooting
- Alerting Manual
The service catalog contains team definitions and service definitions.
The metrics catalog is where services and their service level idicators can be changed.
There are many reasons to alter the existing configuration for notifications: - Too many false positive notifications - Un-actionable notifications - The notifcation is not a real problem
These configurations should be reviewed often and updated when neccessary. The following sections mostly describe the values you can look to for tuning notifications derived from the metrics catalog. The metrics catalog README has a good breakdown of the structure of a service definition for these parameters below.
SLI components can have a specific severity defined. Sometimes the alert is important enough to go to Slack (Sev 3 or 4), but not important enough to require notifying an on-call in via page (Sev 1 or 2). Below is a snippet of an SLI that is set to appear as a Slack notification, but not page.
serviceLevelIndicators: {
sentry_events: {
severity: 's3',
userImpacting: false,
Examples
- Downgrade Sentry SLO service alerts to not page
- Adjusting the Severity for archive replicas
- Turning blackbox notifications into slack only
SLI component selectors can allow a metric to exclude or include metric labels.
For example, this selector definition for the frontend service excludes canary, websockets, and api_rate_limit backends from the apdex.
selector='type="frontend", backend_name!~"canary_.*|api_rate_limit|websockets"'
Examples
For an Apdex, the tolerated and satisfied thresholds can be changed to better match the expected latency of service requests.
Examples
Modifying the monitoringThresholds apdexScore value will alter the Apdex threshold for the service as a whole.
Examples
This is similar to the apdexScore but is a value for how many errors are tolerated for the service as a whole.
Examples
It is also quite reasonable to consider removing metrics, stopping notifications, or lowering their severity.
Examples