-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve throughput with optional immediate rescan #122
base: master
Are you sure you want to change the base?
Conversation
A variation on this idea I'd like to consider is to also use I'll experiment with this idea. |
Love this! can we split this out into it's own PR and get this merged first. then we can work on the interval delay |
consumer.go
Outdated
@@ -213,6 +209,18 @@ func (c *Consumer) ScanShard(ctx context.Context, shardID string, fn ScanFunc) e | |||
} | |||
|
|||
shardIterator = resp.NextShardIterator | |||
if c.immediateRescan && len(resp.Records) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the c.immediateRescan = true
we'll skip the <-ctx.Done()
below. wanna make sure the program will gracefully shutdown when told to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, wouldn't that get picked up by the <-ctx.Done()
in the inner loop above? Ok, there's a narrow window between the point where we finish iterating over all records (where <-ctx.Done()
would apply) and looping back to c.client.GetRecords()
but I see this as being very similar to the case of being asked to shutdown while having called out to c.client.GetRecords()
: in either case, the Kinesis API call will complete quickly and we will fall through to the inner loop where <-ctx.Done()
there would return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yep, good point. didn't see the additional check in the loop above (collapsed in PR preview)
Yep no problem. |
#123 is now created just with this logic. I also removed the |
Ran this overnight and I'm happy with the result. I've included that logic in this PR. |
When immediate rescan is enabled, a poll from Kinesis that returns records and also indicates we are still behind the latest record will skip waiting for the next scan interval and rescan immediately. Related, the initial scan for a shard is performed immediately. The scan ticker only applies to subsequent scans. These changes make higher scan intervals feasible, allowing for less chatty clients while also improving overall effective throughput.
Rebased and squashed. |
Thinking on this a little more, I wonder if for certain types of streams this could actually reduce effective thoughput. For low volume streams, this should be innocuous. When there's activity on the stream, we'll quickly catch up and then simply wait for the next scan tick as with the existing behavior. For high volume streams, each call to Kinesis will give us plenty of work to do, so it makes sense to read that data as quickly as possible. However, there is a 5 TPS rate limit per shard for GetRecords, which we should try to respect lest we get throttled. With the immediate rescan logic, the probability of getting throttled is high for a high volume stream. So this is a gap in this PR. A more interesting unintended consequence exists for middle-of-the-road data rates. In this scenario, we are more likely to have a higher number of GetRecords calls with each returning a fewer number of records than if gave a bit of time to queue up a hunk of records to pull at once. Based on that, I'm thinking it makes sense to
Thoughts? |
@jtackaberry I'd say let's keep it simple for now. If the current PR is working well for you then I'm cool with merging in it's current form. One thing I guess could be interesting in this middle case; is whether |
That sounds like a sensible tweak. So you're not overly concerned about hitting the API frequently enough to trigger throttling? I think we may want another ticker that fires according to the AWS-specified rate limit, which we fall back to in "immediate" mode. There's another use case driving this, which I'm going to open an issue about so we can talk about it further. |
This PR adds an option
WithImmediateRescan()
which, when enabled (it's disabled by default), causes the scan loop to skip the scan tick and immediately go back to Kinesis for another batch.Another change is that the first scan is always performed immediately, without waiting for the next scan tick. This new behavior exists regardless of whether immediate rescanning is enabled.
These two control flow changes combined allow for higher scan intervals, which results in less chatty clients while also preserving throughput. In fact, this slightly increases effective throughput per shard even with the default 250ms scan interval by eliminating any residual delay between batches, which improves the bandwidth-delay product.
A second commit in this PR exposes
MillisBehindLatest
in theRecord
struct passed toScanFunc()
. My use case is to allow applications to provide metrics for per-shard lag times, but there could be other uses as well.This second change is unrelated, but since it's in the same area of code as the immediate rescan option they became co-dependent. If this is a problem, I can remove it for now.