Crons: Detect ingestion outages during clock ticks #79328

evanpurkhiser · 2024-10-17T22:05:20Z

There is a failure scenario that can be very disruptive for our customers.

If we have an outage in our ingestion of cron check-ins, specifically where we are dropping check-ins, then we may incorrectly mark customers cron monitors as having missed check-ins. This only happens if we drop check-ins, if we are delayed with check-ins, the clock ticks which drive missed and time-out detections will slow down to match the consumption of check-ins in our topic.

This is highly problematic as it means customers are unable to trust that cron alerts are accurate. This is, however, a difficult problem, since if check-ins never make it into the check-ins topic, how can we differentiate between a customers job failing and not sending a check-in, and us failing to ingest their check-in?

In most of our ingestion failure scenarios we have had a significant drop in check ins. That typically looks something like this:

Improved behavior

If we were able to detect this extreme drop in volume, we could create clock ticks that are marked as being unknown ticks. meaning we are moving the clock forward, but we have a high certainty that we may have lost many check-ins. When this happens instead of creating missed and timed-out check-ins that trigger alerts, we can create missed check-ins that have a "unknown" status and mark in-progress check-ins that are past their latest check-in time as "unknown", again not alerting customers. Once we are certain that we have recovered from our incident, the clock will resume producing ticks that are regular, and not marked as unknown.

Detecting ingestion volume drops

The tricky part here is deciding if we are in an incident state. Ideally we are not relying on an external service telling us that we may be in an incident, since that service itself may be part of the incident (eg, if we had relay report that it was having problems, there's no guarantee that when it's having problems it will just fail to report to us).

My proposed detection solution is rather simple. As we consume check-ins, we keep a bucket for each minute worth of check-ins, that bucket is a counter of how many check-ins were consumed for that minute. We will keep these buckets for 7 days worth of data, that's 10080 buckets.

Each time the clock ticks across a minute, we will look at the last 7 days of that particular minute we ticked over, take some type of average of those 7 counts, and compare that with the count of the minute we just ticked past. If we find that the count here is some percentage different from the previous 7 days of that minute, we will produce our clock tick with a "unknown" marker, meaning we are unsure if we collected enough data for this minute and are likely in an incident. In which case we will create misses and time-outs as "unknown".

Ignoring previous incidents

When a minute was detected as having a abnormally low volume we should reset it's count to some sentinel value like -1 so that when we pick up this minute the next 7 days, we know to ignore the data, since it will not be an accurate representation of typical volume.

Not enough data

Warning

What should we do if we don't have enough data to determine if the past minute is within the expected volume?

Implementation

We should start by implementing this as a simple metric that we track, so we can understand what our typical difference looks like each day. It's possible some days may have many more check-ins, such as Monday's ad midnight. So maybe we will need a different way to evaluate anomalies.

Implementation

Tasks

Give feedback

feat(crons): Record historic check-in volume counts #79448

Scope: Backend
feat(crons): Record stats for volume history at clock tick #79574

Scope: Backend
feat(crons): Add "volume_anomaly_result" result to clock-tick sentry-kafka-schemas#339
feat(crons): Add mark_unknown clock task sentry-kafka-schemas#340
feat(crons): Add volume_anomaly_result to mark_missing task sentry-kafka-schemas#341
ref(crons): Bump sentry-kafka-schemas for updated cron tasks #79783

Scope: Backend
feat(crons): Begin reporting volume_anomaly_result #79729

Scope: Backend
feat(crons): Add mark_unknown clock tick task #79735

Scope: Backend
feat(crons): Support dispatching mark_unknown #79785

Scope: Backend
feat(crons): Update UNKNOWN status for its new usage #80348

Scope: Backend
feat(crons): Skip marking unknowns as failures #80347

Scope: Backend
feat(crons): mark misses as UNKNOWN when tick volume is ABNORMAL #80355

Scope: Backend
Options

The text was updated successfully, but these errors were encountered:

Part of GH-79328

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328

When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This adds a function `_evaluate_tick_decision` which looks back at the last MONITOR_VOLUME_RETENTION days worth of history and compares the minute we just ticked past to that data. We record 3 metrics from this comparison - `z_value`: This is measured as a ratio of standard deviations from the mean value - `pct_deviation`: This is the percentage we've deviated from the mean - `count`: This is the number of historic data points we're considering The z_value and pct_deviation will be most helpful in making our decision as to whether we've entered an "incident" state or not. Part of #79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328

This will start with always reporting ticks as having a "normal" volume. Later we will use the volume anomaly detection to inform this value. Part of GH-79328. Do not merge this until the previous changes have already rolled out.

This task will be used when the clock ticks with the volume_anomaly_result set to `abnormal`. In this scenario we must mark ALL in-progress check-ins as "unknown", since we cannot be sure that the completing check-in was not sent during the lost data that caused the volume drop of check-ins. Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com> Co-authored-by: Josh Ferge <[email protected]>

This causes clock ticks with the "volume_anomaly_result" set as abnormal to dispatch a mark_unknown task instead of a check_timeouts task. This is needed since when we detect an anomalous tick we're unable to know if we lost customer data during that tick, meaning all in-progress check-ins that were waiting for closing check-ins must now be invalidated with a status of unknown. Part of GH-79328 Requires #79735

Before this was intended to be used when no status was sent, but it was actually never used for that since our SDKs and API endpoints have always required a status be sent. We're going to use this status for the scenario where we detect anomalous clock ticks. Part of GH-79328

This causes clock ticks with the "volume_anomaly_result" set as abnormal to dispatch a mark_unknown task instead of a check_timeouts task. This is needed since when we detect an anomalous tick we're unable to know if we lost customer data during that tick, meaning all in-progress check-ins that were waiting for closing check-ins must now be invalidated with a status of unknown. Part of GH-79328 Requires #79735

Before this was intended to be used when no status was sent, but it was actually never used for that since our SDKs and API endpoints have always required a status be sent. We're going to use this status for the scenario where we detect anomalous clock ticks. Part of GH-79328

…#80352) When marking a miss we will need to know the TickVolumeAnomolyResult to determine if a miss is created as `MISS` or `UNKNOWN`. Part of GH-79328

evanpurkhiser · 2024-11-08T00:21:32Z

We've realized there is a problem with the approach described in this issue.

As it turns out, it's unlikely we're able to just determine an incident from just looking at volume of a specific clock tick in comparison to volume from the previous clock tick. We did have an incident while building this feature after we've already started collecting metrics for our historic volume, looking at the % deviation from mean during that incident we have the following graph (The second graph being the number of check-ins we processed.

Note that there's a uptick in % mean deviation, followed by a small downtick in % deviation. This downtick is not visible in the actual check ins processed graph and is likely just due to the nature of how the check-in volume variates from minute to minute. So while we did consumer less check-ins at that minute, the deviation is not as great compared to the previous minutes deviation. This does not mean the incident stopped during that downtick, if we were to just use some percentage cut off to determine if a tick is abnormal, we would have processed this tick and likely lost data and create missed check-ins.

This is something we need to account for when determining if we're in an incident state

Given what we've implemented so far, where we make a decision during a tick and will create unknown check-ins instead of missed, and will mark in-progress as unknown. You might intuitively think we can simply delay the click tick and look at future volume deviations during the delayed clock tick, and use that to determine if we are definitely in an incident state, and make the decision for that tick to create misses as unknown etc. But there is a problem with this.

#63166 describes a problem where during a single partition backlog monitors may fail to correctly mark missed check-ins due to to the fact that the clock is delayed, but we are still consuming check-ins from relay. During a clock delay like this, it's possible an OK check-in will come in before we're able to mark missed check-ins for a monitor. By inducing a clock delay for the purpose of detecting incidents, we'll also be inducing this behavior causing real misses to be lost.

Ideally we would just fix this problem, but unfortunately we can't do that. In fact, we are not able to introduce any kind of clock delay without introducing this problem. Even if we were to "fix" this using the proposed solution in that ticket of back-filling missed check-ins, it does not work. Doing this would mean during an incident when we are trying to avoid creating misses and instead create them as unknown, we would still create misses using this backfill. There's no way for them to be backfilled as unknowns since the fact that the clock is delayed (causing this problem in the first place) is because we do not know if we're in an incident yet. So really we would need to delay creating those mises, but that would mean delay processing the OK check in, which we cannot do.

So if we're unable to delay the clock what-so-ever, what can we do?

The problem we're trying to solve for is two fold here

Do NOT notify customers of missed and timed-out check-ins during an ingestion incident at sentry. This is THE most important thing, we do not want to wake up customers at 1am when their job ran fine, but we had an ingestion incident.
Correctly communicate in our check-in history UI that a missed or timed-out check-in's were created during an incident, and likely do not reflect what their monitor was actually reporting.
Be very accurate for when an incident started and when an incident resolved itself by using volume history.

Notifications are the component that needs to be delayed. Specifically notifications for synthetic check-ins (missed, time-out). Once we are sure we are an incident we can mark misses and time-outs as unknown. The question is then how can we delay issue occurrences until we've ticked the clock forward enough to have enough data to determine if we're in an incident and should not produce these notifications?

mark_failed is responsible for calling create_issue_platform_occurance. What we can do is instead immediately dispatching the issue occurrence, is to take the arguments and serialize them into an object that we can put into redis. This object will be keyed as monitor-failure-occurrence:{clock_tick_ts}:{monitor_env_id}.
We will put that key into a redis set that is keyed as only monitor-failure-occurrences. This set is responsible for tracking all issue occurrences that are pending being sent We will have already created the incidents for these occurrences, but because the occurrences will not yet have been sent no notifications will go out.
Finally, we will need to actually dispatch these occurrences. This is when we will determine if we may be in an incident and should delay dispatching the occurrences so we can look at future check-in volume. The actual dispatching will be driven again by the clock tick. Since the clock tick is the source-of-truth for each tick volume, in order to know we have collected enough volume we need to know where the clock is. As the each tick runs, here's what it will do
1. Get the monitor-fail-occurances set of monitor-failure-occurance:{clock_tick_ts} keys. The clock_tick_ts in the key will always be earlier than the tick that just did this lookup. At this point, we can look at the volume history for the current tick and decide if it has met a very low threshold that we would consider an "incident".
2. If we have not met the incident threshold we will simply dispatch all the occurrences to send notifications to customers.
3. If we have met the threshold, we will now introduce a delay in which set of failure occurrences will be processed. This delay will allow for us to look at future clock tick volume and determine if we are entering an incident. Once we've entered an incident we can likely store this in redis and use this for future processing of occurrences to NOT dispatch them.
4. At this point we can also likely dispatch some type of task to mark misses and time-outs as unknown during the incident.
5. Once the volume recovers back to a normal amount after some time, we can reset the delay and return to actually dispatching occurrences.

Notes

Misses are going to show as missed in the UI until they are marked as unknown.
Timeouts get marked during an incident, how are we able to go back and mark those time-outs as unknowns
- We can look at the timeout_at and see if that fell within the window for the check-in

It would be nice if we could NOT show the missed and timeouts in the UI until we've dispatch a notification maybe?

evanpurkhiser added a commit that referenced this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts

535e3b2

Part of GH-79328

evanpurkhiser mentioned this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts #79448

Merged

evanpurkhiser added a commit that referenced this issue Oct 21, 2024

feat(crons): Record historic check-in volume counts

f5759f0

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

a878462

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

8041706

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts

e057928

Part of GH-79328

evanpurkhiser added a commit that referenced this issue Oct 22, 2024

feat(crons): Record historic check-in volume counts (#79448)

d3db769

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

evanpurkhiser mentioned this issue Oct 22, 2024

feat(crons): Record stats for volume history at clock tick #79574

Merged

cmanallen pushed a commit that referenced this issue Oct 23, 2024

feat(crons): Record historic check-in volume counts (#79448)

819a4b7

Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>

evanpurkhiser mentioned this issue Oct 23, 2024

feat(crons): Add "volume_anomaly_result" result to clock-tick getsentry/sentry-kafka-schemas#339

Merged

evanpurkhiser mentioned this issue Oct 23, 2024

feat(crons): Add mark_unknown clock task getsentry/sentry-kafka-schemas#340

Merged

evanpurkhiser mentioned this issue Oct 24, 2024

feat(crons): Add volume_anomaly_result to mark_missing task getsentry/sentry-kafka-schemas#341

Merged

evanpurkhiser added a commit that referenced this issue Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult

2bab3d1

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

This was referenced Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult #79715

Merged

feat(crons): Begin reporting volume_anomaly_result #79729

Merged

evanpurkhiser added a commit that referenced this issue Oct 24, 2024

feat(crons): Add TickVolumeAnomolyResult (#79715)

0fc47b1

This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328

evanpurkhiser mentioned this issue Oct 25, 2024

feat(crons): Add mark_unknown clock tick task #79735

Merged

evanpurkhiser mentioned this issue Oct 25, 2024

feat(crons): Support dispatching mark_unknown #79785

Merged

This was referenced Nov 6, 2024

feat(crons): Skip marking unknowns as failures #80347

Open

feat(crons): Update UNKNOWN status for its new usage #80348

Merged

evanpurkhiser mentioned this issue Nov 6, 2024

feat(crons): Pass volume_anomaly_result through to mark_missing tasks #80352

Merged

evanpurkhiser mentioned this issue Nov 7, 2024

feat(crons): mark misses as UNKNOWN when tick volume is ABNORMAL #80355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crons: Detect ingestion outages during clock ticks #79328

Crons: Detect ingestion outages during clock ticks #79328

evanpurkhiser commented Oct 17, 2024 •

edited

Loading

Tasks

evanpurkhiser commented Nov 8, 2024 •

edited

Loading

Crons: Detect ingestion outages during clock ticks #79328

Crons: Detect ingestion outages during clock ticks #79328

Comments

evanpurkhiser commented Oct 17, 2024 • edited Loading

Improved behavior

Detecting ingestion volume drops

Ignoring previous incidents

Not enough data

Implementation

Implementation

Tasks

evanpurkhiser commented Nov 8, 2024 • edited Loading

Notes

evanpurkhiser commented Oct 17, 2024 •

edited

Loading

evanpurkhiser commented Nov 8, 2024 •

edited

Loading