Testing alerts and SLOs #44
emschwartz
started this conversation in
Ideas
Replies: 0 comments 1 reply
-
I really like the chaos idea, that sounds really cool! I feel it would need the For testing the alerts, in order for it to be meaningful it would need to be tested on the system that will be used in production. I see value in testing the alert - 'we've tested this and the alert fires when it's supposed to' - and have had many bad experiences from alerts not triggering when they should. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@mies was asking about whether we can make it easy to test your SLOs. That brings up an interesting question about what you would actually want to test.
One idea would be to separately think about what it means to add "unit tests" and "integration tests" for your SLOs. Unit tests would probably use some functionality exposed by the libraries and wouldn't connect to Prometheus or anything external to your code. Integration tests would run through generating traffic, having Prometheus scrape it, and you'd probably want to see some example alerts show up in Slack or wherever you are sending them.
On the integration testing idea, one thought was that we could have the libraries expose functionality to run a mock metrics server that produces metrics as if it were your application. This is in lieu of actually mocking out your functions and calling the functions. Instead, we can just produce the metrics with the names of your functions. It would be nice if this mock server thing also allowed you to fiddle with the metrics, for example bumping the error rate for a specific function.
Another fun idea would be to give the mock metrics server an "I'm feeling unlucky" mode that provides a kind of small chaos experiment where it randomly adds in some bad metrics for your app and then you need to run through the exercise of debugging the issue. (Making the
caller
label work with this would be somewhat tricky because we generally only know that from the code actually running as opposed to static analysis.)What do folks think? What kind of testing would be useful for alerts and SLOs?
Beta Was this translation helpful? Give feedback.
All reactions