Add function to expose /proc/softnet_stat data #291

ChimeraCoder · 2017-10-17T21:10:12Z

Summary

On Linux systems, we have access to network performance data at /proc/softnet_stat. This allows us to do things like monitor the number of packets we're dropping.

This data is almost wholly undocumented, except in the source code. The packagecloud blog has a great post that goes into more detail.

/proc/softnet_stat tells us whether the NIC is out of memory. This contrasts with /proc/net/snmp, which gives us InErrors, which tells us when the socket's receive buffer is full.

If needed, the latter can be configured with sysctl (net.core.rmem_{max,default}). This is what we're doing when we set unix.SetsockoptInt from within Veneur.

Motivation

Verify that UDP network traffic is not a bottleneck

Test plan

Rollout/monitoring/revert plan

r? @cory-stripe
cc @stripe/observability

stripe-ci · 2017-10-17T21:10:33Z

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cc @stripe/observability
cc @stripe/observability-stripe

cory-stripe · 2017-10-17T21:13:10Z

Neat! Buuuut, this doesn't emit any metrics? (yet?)

ChimeraCoder · 2017-10-17T21:19:12Z

Ah yeah, I should have tagged this as [wip]. This doesn't actually expose the metrics yet; it just adds the function that does the hard part.

cory-stripe · 2017-10-23T14:52:34Z

Wouldn't this be better as an agent check rather than adding it in Veneur?

ChimeraCoder · 2017-12-19T17:12:15Z

I don't think so. This is important enough to Veneur itself that we should provide the means for this functionality whether or not Veneur is running alongside the Datadog agent.

cory-stripe · 2017-12-19T17:13:50Z

I'm unsure why this data merits being here but others don't. This feels like a very steep, slippery slope into being a system agent…

ChimeraCoder · 2017-12-19T17:34:27Z

Here's the way I look at the dividing line: Veneur should be able to diagnose (but not fix) its own health without external involvement. In other words, we should be able to have nothing more than a process supervisor running Veneur (which detects process crashes), and Veneur itself is responsible for coming up with the correct answers to questions like "am I unhealthy, overloaded, or otherwise not behaving properly?". If the answer to those is "yes", we can then use external systems to mitigate automatically. The easiest way we do this already today is to crash the process (and have daemontools/systemd pick up the rest), but obviously there are more advanced options as well. (Veneur is not responsible for its own controllability - it's only responsible for coming up with the answers to those questions).

This is more or less the philosophy of the Operator principle in Kubernetes - while we aren't moving (everything) to Kubernetes (yet), we are adopting the Operator principle more or less across our infrastructure even where we're not running Kubernetes.

So, the reason we want /proc/softnet_stat is because it tells us whether Veneur itself is actually working properly. In an ideal world, this would allow us to have a more sophisticated response to /healthcheck - if we're dropping a lot of packets, Veneur could start failing its own healthcheck, causing us to replace it. I'd rather not stick that in the Datadog agent, because

(a) we don't want to tie Veneur needlessly to the Datadog agent; and

(b) that prevents us from being able to leverage this data in Veneur's /healthcheck - Kubernetes gives us a lot of stuff "for free" if we're able to access the data there.

cory-stripe · 2017-12-19T18:36:32Z

Ok, this is definitely convincing. In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

What data would we peek at to get this for TCP? Are there other metrics we consider influential to health in this regard?

I'm being pedantic here because this takes us into new territory and opens the door for other things. Should veneur monitor disk fullness or other types of things that emit to dmesg? This is really interesting to me but seems ripe for a lot of code duplication and half implementations if duplicated across various services.

ChimeraCoder · 2017-12-19T22:38:33Z

In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

Sure - if there's an equivalent for domain sockets (buffer usage?), we should add that too!

What data would we peek at to get this for TCP?

Offhand, I'm not really sure what the equivalent measure is for TCP.

Should veneur monitor disk fullness or other types of things that emit to dmesg?

No - veneur doesn't (directly) write to disk. Even if it did, we wouldn't want to track that independently of the errors it would presumably report when attempting to write.

The reason that these packet drops are important is because Veneur is a bit of a special case - it's consuming the observability primitives themselves, so it needs to be able to know if it's systematically missing any.

In general, I don't think services should be reading /proc/softnet_stat, but if Veneur doesn't, there's no way for us to know if the data it's reporting from other services is secretly incomplete.

In a pure-Kubernetes world, you're right that we would also probably want a DaemonSet that read this data on every node, to aid in scheduling pods on nodes (as opposed to scaling application pods). But we're not in a pure-Kubernetes world, and also that's still at most two places (so, some duplication, but not a ton). And I think that DaemonSet would be something like veneur-prometheus, so it could share the code anyway.

cory-stripe · 2017-12-20T16:32:28Z

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

I agree with your reasons for this, but it feels weird to have one of our ingress mechanisms measured in this way but none of the others. For example we know of active TCP use in the wild. We don't have this problem in our infrastructure as seen by our passive monitoring. It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health. For example if a system agent emitted this data into a local veneur and there were a mechanism in Veneur to "peek" into that data locally — think of Prometheus' local endpoint for scraping — then a separate process on the host could react to failed SLOs for network stack signals and mark health.

In other words, I guess I still feel like this is a very system agenty thing. :/

ChimeraCoder · 2017-12-20T22:04:49Z

(continuing from Slack)

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

Yup - I originally wrote this up because of an issue we were investigating, involving hypothesized overloaded instances.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health.

We could do that, but it's orthogonal to the real problem at hand: how does an actual instance of Veneur know that it's overloaded? It can't rely on another service to feed it those metrics, because those metrics would be dropped if it's overloaded.

It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Veneur's in a special place because it's intended to run on every pod, and, in addition, if it gets overloaded, then we lose all visibility into the pod.

This may make more sense once I get around to writing up the "Veneur + Kubernetes" post, but in short: Veneur needs to be able to evaluate the health of itself within a pod, without talking to any external services. We could (optionally) have system agents checking this information as well, but we can't rely on them for determining if Veneur itself is overloaded or not.

cory-stripe · 2017-12-20T22:15:39Z

Ok. This still makes my SRE brain itch, but i'm in FUDland now and that's not helpful.

cory-stripe

Just some nits, you can ignore them if you want.

Thanks for the discussion!

cory-stripe · 2017-12-20T22:17:03Z

networking.go

@@ -191,3 +195,81 @@ func startSSFUnix(s *Server, addr *net.UnixAddr) <-chan struct{} {
 	}()
 	return done
 }
+
+type SoftnetData struct {


This makes logical sense here, but perhaps it's better placed in something os metric centric package-wise?

cory-stripe · 2017-12-20T22:18:22Z

networking.go

+// SoftnetStat provides information about UDP traffic, as provided
+// on Linux machines by /proc/net/softnet_stat. Its behavior on other
+// operating systems is undefined.
+func SoftnetStat() (SoftnetData, error) {


This likely should check runtime for Linux and squawk? (Although the caller could do that too.

Add function to expose /proc/softnet_stat data

13ec5ee

stripe-ci assigned cory-stripe Oct 17, 2017

ChimeraCoder mentioned this pull request Oct 19, 2017

Combine global/local logic into one unified flush and parallelize sinks. #292

Merged

cory-stripe assigned aditya-stripe and unassigned cory-stripe Dec 13, 2017

ChimeraCoder assigned cory-stripe and unassigned aditya-stripe Dec 19, 2017

cory-stripe approved these changes Dec 20, 2017

View reviewed changes

stripe-ci added the approved label Dec 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to expose /proc/softnet_stat data #291

Add function to expose /proc/softnet_stat data #291

ChimeraCoder commented Oct 17, 2017 •

edited

Loading

stripe-ci commented Oct 17, 2017

cory-stripe commented Oct 17, 2017

ChimeraCoder commented Oct 17, 2017

cory-stripe commented Oct 23, 2017

ChimeraCoder commented Dec 19, 2017

cory-stripe commented Dec 19, 2017

ChimeraCoder commented Dec 19, 2017 •

edited

Loading

cory-stripe commented Dec 19, 2017 •

edited

Loading

ChimeraCoder commented Dec 19, 2017 •

edited

Loading

cory-stripe commented Dec 20, 2017

ChimeraCoder commented Dec 20, 2017

cory-stripe commented Dec 20, 2017

cory-stripe left a comment

cory-stripe Dec 20, 2017

cory-stripe Dec 20, 2017

Add function to expose /proc/softnet_stat data #291

Are you sure you want to change the base?

Add function to expose /proc/softnet_stat data #291

Conversation

ChimeraCoder commented Oct 17, 2017 • edited Loading

Summary

Motivation

Test plan

Rollout/monitoring/revert plan

stripe-ci commented Oct 17, 2017

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cory-stripe commented Oct 17, 2017

ChimeraCoder commented Oct 17, 2017

cory-stripe commented Oct 23, 2017

ChimeraCoder commented Dec 19, 2017

cory-stripe commented Dec 19, 2017

ChimeraCoder commented Dec 19, 2017 • edited Loading

cory-stripe commented Dec 19, 2017 • edited Loading

ChimeraCoder commented Dec 19, 2017 • edited Loading

cory-stripe commented Dec 20, 2017

ChimeraCoder commented Dec 20, 2017

cory-stripe commented Dec 20, 2017

cory-stripe left a comment

Choose a reason for hiding this comment

cory-stripe Dec 20, 2017

Choose a reason for hiding this comment

cory-stripe Dec 20, 2017

Choose a reason for hiding this comment

ChimeraCoder commented Oct 17, 2017 •

edited

Loading

ChimeraCoder commented Dec 19, 2017 •

edited

Loading

cory-stripe commented Dec 19, 2017 •

edited

Loading

ChimeraCoder commented Dec 19, 2017 •

edited

Loading