Testing alerts and SLOs #44

emschwartz · 2023-05-01T15:55:11Z

emschwartz
May 1, 2023

@mies was asking about whether we can make it easy to test your SLOs. That brings up an interesting question about what you would actually want to test.

One idea would be to separately think about what it means to add "unit tests" and "integration tests" for your SLOs. Unit tests would probably use some functionality exposed by the libraries and wouldn't connect to Prometheus or anything external to your code. Integration tests would run through generating traffic, having Prometheus scrape it, and you'd probably want to see some example alerts show up in Slack or wherever you are sending them.

On the integration testing idea, one thought was that we could have the libraries expose functionality to run a mock metrics server that produces metrics as if it were your application. This is in lieu of actually mocking out your functions and calling the functions. Instead, we can just produce the metrics with the names of your functions. It would be nice if this mock server thing also allowed you to fiddle with the metrics, for example bumping the error rate for a specific function.

Another fun idea would be to give the mock metrics server an "I'm feeling unlucky" mode that provides a kind of small chaos experiment where it randomly adds in some bad metrics for your app and then you need to run through the exercise of debugging the issue. (Making the caller label work with this would be somewhat tricky because we generally only know that from the code actually running as opposed to static analysis.)

What do folks think? What kind of testing would be useful for alerts and SLOs?

IvanMerrill · 2023-05-08T19:36:34Z

IvanMerrill
May 8, 2023

I really like the chaos idea, that sounds really cool! I feel it would need the caller label though, as otherwise it's not really helping you have a go in a realistic way, you're missing a really crucial bit of information that to me is one of the superpowers of autometrics.

For testing the alerts, in order for it to be meaningful it would need to be tested on the system that will be used in production. I see value in testing the alert - 'we've tested this and the alert fires when it's supposed to' - and have had many bad experiences from alerts not triggering when they should.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autometrics

Testing alerts and SLOs #44

{{title}}

Replies: 0 comments 1 reply

{{title}}

Select a reply

Autometrics

Testing alerts and SLOs #44

emschwartz May 1, 2023

Replies: 0 comments · 1 reply

IvanMerrill May 8, 2023

emschwartz
May 1, 2023

Replies: 0 comments 1 reply

IvanMerrill
May 8, 2023