Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on Datadog API errors #138

Merged
merged 6 commits into from
Dec 5, 2024
Merged

Retry on Datadog API errors #138

merged 6 commits into from
Dec 5, 2024

Conversation

Mr0grog
Copy link
Collaborator

@Mr0grog Mr0grog commented Dec 5, 2024

This adds retry support for submitting metrics. It’s rare, but at scale most users have probably encountered some failures to send metrics (although they might not have noticed it in their logs). Fixes #108. This is the last major item for v0.13.0!

There are a bunch of ways to approach retries, but they fall into two main categories:

  1. Retry API requests N times if they fail.
  2. Keep a queue of metrics to send and add failed metrics back onto that queue. Then automatically get retried on the next flush.

I took approach 1 here, since it was much simpler to implement in the current architecture. For long running services (e.g. a web server), approach 2 is probably slightly better, but after some brief experimentation, it felt like implementing it required a lot of coordination work between the reporter and logger about what failures are retryable, and new tooling for measuring the size of a metric object to manage the queue size. We might still want to do it in the future, but I wanted to keep things a little simpler for now.

The underlying @datadog/datadog-api-client library added built-in retry support in v1.17.0, but it turns out to only retry on HTTP errors (when the server responds with a >= 400 status code) and not on network errors (e.g. broken connections). In my experience network errors make up a lot of the actual failures in practice, so those are important to cover. So this implementation includes a complete (but simple) retry and backoff algorithm instead of relying on the builtins. 🤷

You can configure retries through the retries (how many) and retryBackoff (how long to delay) options:

metrics.init({
  // The default is 2.
  retries: 3,

  // The default is 1.
  // Subsequent retries multiply this by powers of two, so this produces retries after:
  //   1 second, 2 seconds, 4 seconds
  retryBackoff: 1
});

⚠️ Still not sure about: This changes the constructor for DatadogReporter to use an options object instead of positional arguments, since it has so many options now. I’ve deprecated the old positional arguments signature, but I doubt direct usage of that class is common, so it might be fine to just make a breaking change (the only example I could find in real code on GitHub only uses it to support the DD_API_KEY environment variable, with I made work automatically in #135). Will sleep on it before merging. (Update: made this a breaking change.)

This relies on the built-in retry functionality added to `@datadog/datadog-api-client` to retry API requests on HTTP errors. It keeps things simple, but the main drawback is that it does not retry on *network* errors (when the Datadog server on the other end is unable to receive or reply to the request in the first place). There's an open issue for that on the Datadog client repo, but no replies.

Fixes #108.
A lot of intermittent API problems are networking issues that prevent HTTP requests from completing at all. The official Datadog client's retry does not handle any of these, so this adds a totally custom retry implementation in order to do so.
@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Dec 5, 2024

I’ve deprecated the old positional arguments signature [to DatadogReporter], but I doubt direct usage of that class is common, so it might be fine to just make a breaking change

After thinking on this, I decided to make this a breaking change. The class is so rarely used directly by public consumers that I don’t think maintaining support for both signatures made sense.

@Mr0grog Mr0grog merged commit 64a6249 into main Dec 5, 2024
10 checks passed
@Mr0grog Mr0grog deleted the 108-try-try-again branch December 5, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support retries
1 participant