fix(queue): Reset ackAttempts on successful SQL ack #4479

nicolasff · 2023-06-20T21:43:05Z

Reset ackAttempts from the message metadata when a message is ack'd.
This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.

More detailed explanation

A long-running Orca Task in a Stage will cause the same RunTask message to be de-queued and re-enqueued over and over with all its attributes as it returns RUNNING to indicate it has not yet finished executing. If any part of the queue management code fails because Orca is too busy or even crashes right after de-queuing a message, the message will have its “un-acknowledged” counter incremented by one (ackAttempts++). For long-running stages, it is possible for a task to eventually reach ackAttempts = 5 (Queue.maxRetries, link) which will cause SqlQueue in Orca to abandon the message and effectively stop processing its branch of the pipeline. When a long-running pipeline execution is still marked as RUNNING but no longer has any messages on the queue, it becomes a “zombie” that can’t be canceled by regular users.

The proposed fix is to reset ackAttempts to zero when a message is processed successfully, as would happen repeatedly with a long-lived stage. Instead of dropping messages when they reach 5 missed acknowledgments in total, we’ll now drop them only if they miss 5 in a row – which gives us a clear indication that the message just cannot be processed at this time.

Consider the analogy of a ping sequence used to monitor a host's uptime: if we leave ping running for 2 weeks monitoring a remote host, do we mark the host as down once it has missed 5 pings in total over these 2 weeks, or when it has missed 5 in a row?

xibz · 2023-06-20T21:54:11Z

Related to and the cause of spinnaker/spinnaker#6597

dbyron-sf · 2023-06-20T23:38:35Z

This looks like a sql-specific change. Mind updating the commit message to indicate that? Or perhaps implement it for redis as well :)

nicolasff · 2023-06-21T02:32:02Z

I've updated the commit message to mention that this only affects the SQL implementation.

I'm not really sure what can be done with Redis, the docs are correct to say that a significant data loss occurs when the Redis backend goes away. If it's even fixable it will be a much more complex change than what is being proposed here.

dbyron-sf · 2023-06-21T19:55:45Z

Thanks @nicolasff!

dbyron-sf · 2023-06-21T19:56:32Z

What's involved in writing an automated test to exercise this functionality?

nicolasff · 2023-06-22T18:24:10Z

I've been wondering how to write a test for this. The way I went about validating the change involved manual fault injection within Orca in specific places, followed by breakpoints in other places to observe the ackAttempts flag being reset, all the while while keeping tabs on the SQL data in a terminal. Definitely not a repeatable process.

To trigger this condition and cover the if (resetAttemptsOnAck) { condition, we could use:

A SqlQueue instance
A QueueProcessor using a thread pool containing a single thread
An Orca "task" class (com.netflix.spinnaker.orca.api.pipeline.Task), maybe one that calls Thread.sleep() to block that single thread
A thread producing messages and enqueueing them, only for the queue processor to fail to submit them to its thread pool, accumulating missed acks.

I'm going to look into implementing this today.

In the SQL queue implemementation, reset ackAttempts from the message metadata when a message is ack'd. This avoids having this number grow over time for very long-lived messages that keep getting re-enqueued, with the occasional failure causing a message to not be acknowledged and eventually get dropped once 5 acks have been missed in total.

nicolasff · 2023-07-07T00:41:27Z

Latest push: added a Spek test for both values of resetAttemptsOnAck, showing that the ackAttempts attribute is reset on ack when this option is enabled. Also rebased on master (currently b23c78b).

edit: test run:

mattgogerly · 2023-07-11T08:20:38Z

keiko-sql/src/main/kotlin/com/netflix/spinnaker/q/sql/SqlQueue.kt

@@ -842,25 +844,62 @@ class SqlQueue(
    }
  }

-  private fun ackMessage(fingerprint: String) {
+  private fun ackMessage(fingerprint: String, message: Message) {
+    if (log.isDebugEnabled) {


Pardon my ignorance, is this required? If debug logging isn't enabled log.debug won't do anything anyway right?

If fingerprint is big, we'll still spend non-trivial compute resources building the string to log even though log.debug doesn't actually log it.

dbyron-sf · 2023-07-12T18:39:05Z

keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt

+import java.util.*
+
+
+@RunWith(JUnitPlatform::class)


With PRs like #4484, we're trying to modernize junit things. I tried on my local machine and the tests still execute without this line. Here are the diffs I used:

diff --git a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt index e7bb78f82..0ff10042a 100644 --- a/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt +++ b/keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt @@ -29,12 +29,8 @@ import org.jetbrains.spek.api.dsl.describe import org.jetbrains.spek.api.dsl.given import org.jetbrains.spek.api.dsl.it import org.jetbrains.spek.api.dsl.on -import org.junit.platform.runner.JUnitPlatform -import org.junit.runner.RunWith import java.util.* - -@RunWith(JUnitPlatform::class) class SqlAckQueueTest : Spek({ describe("both values of resetAttemptsOnAck") { // map of resetAttemptsOnAck to expected number of ackAttempts still on the message after ack

dbyron-sf · 2023-07-12T18:39:13Z

keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt

+import org.jetbrains.spek.api.dsl.on
+import org.junit.platform.runner.JUnitPlatform
+import org.junit.runner.RunWith
+import java.util.*


please avoid star imports

dbyron-sf · 2023-07-12T18:43:21Z

keiko-sql/src/test/kotlin/com/netflix/spinnaker/q/sql/SqlAckQueueTest.kt

+
+    // check both values of resetAttemptsOnAck
+    flagToAckAttempts.forEach { resetFlag, expectedAckAttempts ->
+      val testDescription = "SqlQueue with resetAttemptsOnAck = $resetFlag"


I don't see this in the test output (with no local changes). Here's what I see:

nicolasff · 2023-07-12T19:22:25Z

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report.
I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

dbyron-sf · 2023-07-12T19:25:53Z

@dbyron-sf thanks for all the feedback, I'll make the changes you suggested and will look into this discrepancy with the HTML test report. I was thinking to also add some extra steps to the test, validating that the message makes it to the DLQ or not depending on the flag after say 4 processing failures + 1 success + 1 re-enqueue + 1 more failure.

Sounds great. The more coverage the better :)

mattgogerly · 2024-04-15T18:58:04Z

@dbyron-sf Any thoughts about this? Looks like the requested changes didn't happen but it's potentially a useful fix?

nicolasff force-pushed the queue-reset-ack-attempts branch from 092caf8 to 95b371d Compare June 21, 2023 02:27

nicolasff added 2 commits July 6, 2023 17:39

Add SqlAckQueueTest to validate resetAttemptsOnAck

9340a52

nicolasff force-pushed the queue-reset-ack-attempts branch from b57dd91 to 9340a52 Compare July 7, 2023 00:39

mattgogerly reviewed Jul 11, 2023

View reviewed changes

dbyron-sf reviewed Jul 12, 2023

View reviewed changes

dbyron-sf changed the title ~~fix(queue): Reset ackAttempts on successful ack~~ fix(queue): Reset ackAttempts on successful SQL ack Jul 12, 2023

dbyron-sf reviewed Jul 12, 2023

View reviewed changes

Merge branch 'master' into queue-reset-ack-attempts

757004b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(queue): Reset ackAttempts on successful SQL ack #4479

fix(queue): Reset ackAttempts on successful SQL ack #4479

nicolasff commented Jun 20, 2023

xibz commented Jun 20, 2023 •

edited

Loading

dbyron-sf commented Jun 20, 2023

nicolasff commented Jun 21, 2023

dbyron-sf commented Jun 21, 2023

dbyron-sf commented Jun 21, 2023

nicolasff commented Jun 22, 2023

nicolasff commented Jul 7, 2023 •

edited

Loading

mattgogerly Jul 11, 2023

dbyron-sf Jul 12, 2023

dbyron-sf Jul 12, 2023

dbyron-sf Jul 12, 2023

dbyron-sf Jul 12, 2023

nicolasff commented Jul 12, 2023

dbyron-sf commented Jul 12, 2023

mattgogerly commented Apr 15, 2024

fix(queue): Reset ackAttempts on successful SQL ack #4479

Are you sure you want to change the base?

fix(queue): Reset ackAttempts on successful SQL ack #4479

Conversation

nicolasff commented Jun 20, 2023

More detailed explanation

xibz commented Jun 20, 2023 • edited Loading

dbyron-sf commented Jun 20, 2023

nicolasff commented Jun 21, 2023

dbyron-sf commented Jun 21, 2023

dbyron-sf commented Jun 21, 2023

nicolasff commented Jun 22, 2023

nicolasff commented Jul 7, 2023 • edited Loading

mattgogerly Jul 11, 2023

Choose a reason for hiding this comment

dbyron-sf Jul 12, 2023

Choose a reason for hiding this comment

dbyron-sf Jul 12, 2023

Choose a reason for hiding this comment

dbyron-sf Jul 12, 2023

Choose a reason for hiding this comment

dbyron-sf Jul 12, 2023

Choose a reason for hiding this comment

nicolasff commented Jul 12, 2023

dbyron-sf commented Jul 12, 2023

mattgogerly commented Apr 15, 2024

xibz commented Jun 20, 2023 •

edited

Loading

nicolasff commented Jul 7, 2023 •

edited

Loading