How should SLO alerts give details about the failing functions? #51
-
ContextAs we are dogfooding the implementation of autometrics in our own API for Service Level Objective alerts, we noticed that the alerts generated by the current rules in our generated The main pain point is that the alerts are "global" thresholds that accumulate latency and success rate data across all the functions in the same group, so when an alert is triggered, the information of "which functions are contributing the most to the alert triggering" is hard to obtain. For context, here's what an alert looks like (on Slack) when it triggers, notice how the metadata never mentions function names:
IdeasWe are discussing internally to try to see what could be changed to have better alert messages that help kickstart the debugging process whenever an alert is triggered, so that autometrics also provides value in incident response; but we also welcome any suggestion and for that purpose, we will continue the conversation here. Here are a few ideas that have been thrown around More granular alerting rulesWe thought about changing the generated alerting rule, so that it sums the latency/success rate separately for each function in the service instead of observing the metrics for the complete set of functions. This way we would get "for free" the function names in metadata of the alert. The main reason we do not like this idea is that it breaks the idea that "SLO alerts trigger when the service as a whole struggles", since alerts would not trigger if each function in the service ate 99.9999...% of the error budget of the SLO (but as soon as you have 10 functions in the service, you would be 899.99999...% over budget for the service) Add a recording rule for the "top offenders" queryThe query for the top offenders of an objective in the service is as hard as for users to write as it is for Prometheus to serve (being responsible for a couple of crashes on our dev instance of Prometheus) (discussed a bit in autometrics-dev/autometrics-rs#15)
Maybe with some effort we could make a recording rule like that automatically that would also be added to the automatically generated rules file with a human-friendly name? All input welcome! |
Beta Was this translation helpful? Give feedback.
Replies: 0 comments 2 replies
-
I wonder whether having a query (which is currently included in the yet-to-be-released Grafana dashboard we're working on) that shows the latencies for all of the functions in the particular SLO would make the alerts easy enough to debug. This is somewhat similar to the "top offenders query" mentioned, but instead of looking at all of the functions, just looks at those that are part of the SLO:
Looking through the functions PromQL provides, it doesn't seem like there would be a way to simultaneously merge function metrics into a single SLO while including the specific functions' names as labels on the final time series 😕 |
Beta Was this translation helpful? Give feedback.
-
I agree with @emschwartz. Going more granular seems like an SLO anti-pattern where you end up with an SLO for just about everything. This feels similar to traditional alerting models which are what SLOs are trying to move away from. Top offenders doesn't actually necessarily show you what you want to know, which is what has changed the most to cause this SLO breach. It's possible to have a function that has a high error rate (maybe due to a high number of external dependencies) that is factored into the SLO. If your SLO breaches and you get an alert you want to know what has changed behaviour in terms of errors, not what has errored the most. This is a subtle difference but can easily lead people down rabbit holes when investigating an issue. Providing the ability to quickly see a graph showing all functions that contribute to the SLO allows you to see each in context of the others, and plot over varying time lengths. This means you can identify what's changed behaviour the most which is a better starting point. |
Beta Was this translation helpful? Give feedback.
I agree with @emschwartz.
Going more granular seems like an SLO anti-pattern where you end up with an SLO for just about everything. This feels similar to traditional alerting models which are what SLOs are trying to move away from.
Top offenders doesn't actually necessarily show you what you want to know, which is what has changed the most to cause this SLO breach. It's possible to have a function that has a high error rate (maybe due to a high number of external dependencies) that is factored into the SLO. If your SLO breaches and you get an alert you want to know what has changed behaviour in terms of errors, not what has errored the most. This is a subtle difference but can easily lea…