How should SLO alerts give details about the failing functions? #51

gagbo · 2023-03-24T14:55:32Z

gagbo
Mar 24, 2023
Maintainer

Context

As we are dogfooding the implementation of autometrics in our own API for Service Level Objective alerts, we noticed that the alerts generated by the current rules in our generated autometrics.rules.yaml make it hard to go from "Your API currently has a latency issue" to "Here is the most likely culprit function in your code".

The main pain point is that the alerts are "global" thresholds that accumulate latency and success rate data across all the functions in the same group, so when an alert is triggered, the information of "which functions are contributing the most to the alert triggering" is hard to obtain.

For context, here's what an alert looks like (on Slack) when it triggers, notice how the metadata never mentions function names:

Alert: (ticket) autometrics latency-99 SLO error budget burn rate is too fast. - ticket
Description:
Details:
• alertname: High Latency SLO - 99%
• category: latency
• objective_name: api
• objective_percentile: 99
• severity: ticket
• sloth_id: autometrics-latency-99
• sloth_service: autometrics
• sloth_severity: ticket
• sloth_slo: latency-99

Ideas

We are discussing internally to try to see what could be changed to have better alert messages that help kickstart the debugging process whenever an alert is triggered, so that autometrics also provides value in incident response; but we also welcome any suggestion and for that purpose, we will continue the conversation here.

Here are a few ideas that have been thrown around

More granular alerting rules

We thought about changing the generated alerting rule, so that it sums the latency/success rate separately for each function in the service instead of observing the metrics for the complete set of functions. This way we would get "for free" the function names in metadata of the alert.

The main reason we do not like this idea is that it breaks the idea that "SLO alerts trigger when the service as a whole struggles", since alerts would not trigger if each function in the service ate 99.9999...% of the error budget of the SLO (but as soon as you have 10 functions in the service, you would be 899.99999...% over budget for the service)

Add a recording rule for the "top offenders" query

The query for the top offenders of an objective in the service is as hard as for users to write as it is for Prometheus to serve (being responsible for a couple of crashes on our dev instance of Prometheus) (discussed a bit in autometrics-dev/autometrics-rs#15)

topk(5, histogram_quantile(0.99, sum by (le, function, module) (rate(function_calls_duration_bucket{function=~"${function}", function!~"${exclude_function}", module=~"${module}", module!~"${exclude_module}"}[5m]))))

Maybe with some effort we could make a recording rule like that automatically that would also be added to the automatically generated rules file with a human-friendly name?

All input welcome!

Answered by IvanMerrill

May 2, 2023

I agree with @emschwartz.

Going more granular seems like an SLO anti-pattern where you end up with an SLO for just about everything. This feels similar to traditional alerting models which are what SLOs are trying to move away from.

Top offenders doesn't actually necessarily show you what you want to know, which is what has changed the most to cause this SLO breach. It's possible to have a function that has a high error rate (maybe due to a high number of external dependencies) that is factored into the SLO. If your SLO breaches and you get an alert you want to know what has changed behaviour in terms of errors, not what has errored the most. This is a subtle difference but can easily lea…

View full answer

emschwartz · 2023-03-27T12:24:35Z

emschwartz
Mar 27, 2023

I wonder whether having a query (which is currently included in the yet-to-be-released Grafana dashboard we're working on) that shows the latencies for all of the functions in the particular SLO would make the alerts easy enough to debug.

This is somewhat similar to the "top offenders query" mentioned, but instead of looking at all of the functions, just looks at those that are part of the SLO:

histogram_quantile(0.99, sum by (le, function, module) (rate(function_calls_duration_bucket{objective_name="$latency_objective"}[$__rate_interval])))

Looking through the functions PromQL provides, it doesn't seem like there would be a way to simultaneously merge function metrics into a single SLO while including the specific functions' names as labels on the final time series 😕

0 replies

IvanMerrill · 2023-05-02T13:53:56Z

IvanMerrill
May 2, 2023

I agree with @emschwartz.

Going more granular seems like an SLO anti-pattern where you end up with an SLO for just about everything. This feels similar to traditional alerting models which are what SLOs are trying to move away from.

Top offenders doesn't actually necessarily show you what you want to know, which is what has changed the most to cause this SLO breach. It's possible to have a function that has a high error rate (maybe due to a high number of external dependencies) that is factored into the SLO. If your SLO breaches and you get an alert you want to know what has changed behaviour in terms of errors, not what has errored the most. This is a subtle difference but can easily lead people down rabbit holes when investigating an issue.

Providing the ability to quickly see a graph showing all functions that contribute to the SLO allows you to see each in context of the others, and plot over varying time lengths. This means you can identify what's changed behaviour the most which is a better starting point.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autometrics

How should SLO alerts give details about the failing functions? #51

{{title}}

Replies: 0 comments 2 replies

{{title}}

{{title}}

Select a reply

Autometrics

How should SLO alerts give details about the failing functions? #51

gagbo Mar 24, 2023 Maintainer

Context

Ideas

More granular alerting rules

Add a recording rule for the "top offenders" query

Replies: 0 comments · 2 replies

emschwartz Mar 27, 2023

IvanMerrill May 2, 2023

gagbo
Mar 24, 2023
Maintainer

Replies: 0 comments 2 replies

emschwartz
Mar 27, 2023

IvanMerrill
May 2, 2023