Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(queue): Reset ackAttempts on successful SQL ack #4479

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nicolasff
Copy link
Contributor

Reset ackAttempts from the message metadata when a message is ack'd.
This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.


More detailed explanation

A long-running Orca Task in a Stage will cause the same RunTask message to be de-queued and re-enqueued over and over with all its attributes as it returns RUNNING to indicate it has not yet finished executing. If any part of the queue management code fails because Orca is too busy or even crashes right after de-queuing a message, the message will have its “un-acknowledged” counter incremented by one (ackAttempts++). For long-running stages, it is possible for a task to eventually reach ackAttempts = 5 (Queue.maxRetries, link) which will cause SqlQueue in Orca to abandon the message and effectively stop processing its branch of the pipeline. When a long-running pipeline execution is still marked as RUNNING but no longer has any messages on the queue, it becomes a “zombie” that can’t be canceled by regular users.

The proposed fix is to reset ackAttempts to zero when a message is processed successfully, as would happen repeatedly with a long-lived stage. Instead of dropping messages when they reach 5 missed acknowledgments in total, we’ll now drop them only if they miss 5 in a row – which gives us a clear indication that the message just cannot be processed at this time.

Consider the analogy of a ping sequence used to monitor a host's uptime: if we leave ping running for 2 weeks monitoring a remote host, do we mark the host as down once it has missed 5 pings in total over these 2 weeks, or when it has missed 5 in a row?

@xibz
Copy link
Contributor

xibz commented Jun 20, 2023

Related to and the cause of spinnaker/spinnaker#6597

@dbyron-sf
Copy link
Contributor

This looks like a sql-specific change. Mind updating the commit message to indicate that? Or perhaps implement it for redis as well :)

@nicolasff
Copy link
Contributor Author

I've updated the commit message to mention that this only affects the SQL implementation.

I'm not really sure what can be done with Redis, the docs are correct to say that a significant data loss occurs when the Redis backend goes away. If it's even fixable it will be a much more complex change than what is being proposed here.

@dbyron-sf
Copy link
Contributor

Thanks @nicolasff!

@dbyron-sf
Copy link
Contributor

What's involved in writing an automated test to exercise this functionality?

@nicolasff
Copy link
Contributor Author

I've been wondering how to write a test for this. The way I went about validating the change involved manual fault injection within Orca in specific places, followed by breakpoints in other places to observe the ackAttempts flag being reset, all the while while keeping tabs on the SQL data in a terminal. Definitely not a repeatable process.

To trigger this condition and cover the if (resetAttemptsOnAck) { condition, we could use:

  1. A SqlQueue instance
  2. A QueueProcessor using a thread pool containing a single thread
  3. An Orca "task" class (com.netflix.spinnaker.orca.api.pipeline.Task), maybe one that calls Thread.sleep() to block that single thread
  4. A thread producing messages and enqueueing them, only for the queue processor to fail to submit them to its thread pool, accumulating missed acks.

I'm going to look into implementing this today.

In the SQL queue implemementation, reset ackAttempts from the
message metadata when a message is ack'd. This avoids having this
number grow over time for very long-lived messages that keep getting
re-enqueued, with the occasional failure causing a message to not be
acknowledged and eventually get dropped once 5 acks have been missed
in total.
@nicolasff
Copy link
Contributor Author

nicolasff commented Jul 7, 2023

Latest push: added a Spek test for both values of resetAttemptsOnAck, showing that the ackAttempts attribute is reset on ack when this option is enabled. Also rebased on master (currently b23c78b).

edit: test run:
image

@@ -842,25 +844,62 @@ class SqlQueue(
}
}

private fun ackMessage(fingerprint: String) {
private fun ackMessage(fingerprint: String, message: Message) {
if (log.isDebugEnabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon my ignorance, is this required? If debug logging isn't enabled log.debug won't do anything anyway right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fingerprint is big, we'll still spend non-trivial compute resources building the string to log even though log.debug doesn't actually log it.

import java.util.*


@RunWith(JUnitPlatform::class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With PRs like #4484, we're trying to modernize junit things. I tried on my local machine and the tests still execute without this line. Here are the diffs I used:

diff --git a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
index e7bb78f82..0ff10042a 100644
--- a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
+++ b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt
@@ -29,12 +29,8 @@ import org.jetbrains.spek.api.dsl.describe
 import org.jetbrains.spek.api.dsl.given
 import org.jetbrains.spek.api.dsl.it
 import org.jetbrains.spek.api.dsl.on
-import org.junit.platform.runner.JUnitPlatform
-import org.junit.runner.RunWith
 import java.util.*
 
-
-@RunWith(JUnitPlatform::class)
 class SqlAckQueueTest : Spek({
   describe("both values of resetAttemptsOnAck") {
     // map of resetAttemptsOnAck to expected number of ackAttempts still on the message after ack

import org.jetbrains.spek.api.dsl.on
import org.junit.platform.runner.JUnitPlatform
import org.junit.runner.RunWith
import java.util.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid star imports

@dbyron-sf dbyron-sf changed the title fix(queue): Reset ackAttempts on successful ack fix(queue): Reset ackAttempts on successful SQL ack Jul 12, 2023

// check both values of resetAttemptsOnAck
flagToAckAttempts.forEach { resetFlag, expectedAckAttempts ->
val testDescription = "SqlQueue with resetAttemptsOnAck = $resetFlag"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this in the test output (with no local changes). Here's what I see:
image

@nicolasff
Copy link
Contributor Author

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report.
I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

@dbyron-sf
Copy link
Contributor

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

Sounds great. The more coverage the better :)

@mattgogerly
Copy link
Member

@dbyron-sf Any thoughts about this? Looks like the requested changes didn't happen but it's potentially a useful fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants