Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to expose /proc/softnet_stat data #291

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChimeraCoder
Copy link
Contributor

@ChimeraCoder ChimeraCoder commented Oct 17, 2017

Summary

On Linux systems, we have access to network performance data at /proc/softnet_stat. This allows us to do things like monitor the number of packets we're dropping.

This data is almost wholly undocumented, except in the source code. The packagecloud blog has a great post that goes into more detail.

/proc/softnet_stat tells us whether the NIC is out of memory. This contrasts with /proc/net/snmp, which gives us InErrors, which tells us when the socket's receive buffer is full.

If needed, the latter can be configured with sysctl (net.core.rmem_{max,default}). This is what we're doing when we set unix.SetsockoptInt from within Veneur.

Motivation

Verify that UDP network traffic is not a bottleneck

Test plan

Rollout/monitoring/revert plan

r? @cory-stripe
cc @stripe/observability

@stripe-ci
Copy link

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cc @stripe/observability
cc @stripe/observability-stripe

@cory-stripe
Copy link
Contributor

Neat! Buuuut, this doesn't emit any metrics? (yet?)

@ChimeraCoder
Copy link
Contributor Author

Ah yeah, I should have tagged this as [wip]. This doesn't actually expose the metrics yet; it just adds the function that does the hard part.

@cory-stripe
Copy link
Contributor

Wouldn't this be better as an agent check rather than adding it in Veneur?

@ChimeraCoder
Copy link
Contributor Author

I don't think so. This is important enough to Veneur itself that we should provide the means for this functionality whether or not Veneur is running alongside the Datadog agent.

@cory-stripe
Copy link
Contributor

I'm unsure why this data merits being here but others don't. This feels like a very steep, slippery slope into being a system agent…

@ChimeraCoder
Copy link
Contributor Author

ChimeraCoder commented Dec 19, 2017

Here's the way I look at the dividing line: Veneur should be able to diagnose (but not fix) its own health without external involvement. In other words, we should be able to have nothing more than a process supervisor running Veneur (which detects process crashes), and Veneur itself is responsible for coming up with the correct answers to questions like "am I unhealthy, overloaded, or otherwise not behaving properly?". If the answer to those is "yes", we can then use external systems to mitigate automatically. The easiest way we do this already today is to crash the process (and have daemontools/systemd pick up the rest), but obviously there are more advanced options as well. (Veneur is not responsible for its own controllability - it's only responsible for coming up with the answers to those questions).

This is more or less the philosophy of the Operator principle in Kubernetes - while we aren't moving (everything) to Kubernetes (yet), we are adopting the Operator principle more or less across our infrastructure even where we're not running Kubernetes.

So, the reason we want /proc/softnet_stat is because it tells us whether Veneur itself is actually working properly. In an ideal world, this would allow us to have a more sophisticated response to /healthcheck - if we're dropping a lot of packets, Veneur could start failing its own healthcheck, causing us to replace it. I'd rather not stick that in the Datadog agent, because

(a) we don't want to tie Veneur needlessly to the Datadog agent; and

(b) that prevents us from being able to leverage this data in Veneur's /healthcheck - Kubernetes gives us a lot of stuff "for free" if we're able to access the data there.

@cory-stripe
Copy link
Contributor

cory-stripe commented Dec 19, 2017

Ok, this is definitely convincing. In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

What data would we peek at to get this for TCP? Are there other metrics we consider influential to health in this regard?

I'm being pedantic here because this takes us into new territory and opens the door for other things. Should veneur monitor disk fullness or other types of things that emit to dmesg? This is really interesting to me but seems ripe for a lot of code duplication and half implementations if duplicated across various services.

@ChimeraCoder
Copy link
Contributor Author

ChimeraCoder commented Dec 19, 2017

In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

Sure - if there's an equivalent for domain sockets (buffer usage?), we should add that too!

What data would we peek at to get this for TCP?

Offhand, I'm not really sure what the equivalent measure is for TCP.

Should veneur monitor disk fullness or other types of things that emit to dmesg?

No - veneur doesn't (directly) write to disk. Even if it did, we wouldn't want to track that independently of the errors it would presumably report when attempting to write.

The reason that these packet drops are important is because Veneur is a bit of a special case - it's consuming the observability primitives themselves, so it needs to be able to know if it's systematically missing any.

In general, I don't think services should be reading /proc/softnet_stat, but if Veneur doesn't, there's no way for us to know if the data it's reporting from other services is secretly incomplete.

In a pure-Kubernetes world, you're right that we would also probably want a DaemonSet that read this data on every node, to aid in scheduling pods on nodes (as opposed to scaling application pods). But we're not in a pure-Kubernetes world, and also that's still at most two places (so, some duplication, but not a ton). And I think that DaemonSet would be something like veneur-prometheus, so it could share the code anyway.

@cory-stripe
Copy link
Contributor

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

I agree with your reasons for this, but it feels weird to have one of our ingress mechanisms measured in this way but none of the others. For example we know of active TCP use in the wild. We don't have this problem in our infrastructure as seen by our passive monitoring. It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health. For example if a system agent emitted this data into a local veneur and there were a mechanism in Veneur to "peek" into that data locally — think of Prometheus' local endpoint for scraping — then a separate process on the host could react to failed SLOs for network stack signals and mark health.

In other words, I guess I still feel like this is a very system agenty thing. :/

@ChimeraCoder
Copy link
Contributor Author

(continuing from Slack)

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

Yup - I originally wrote this up because of an issue we were investigating, involving hypothesized overloaded instances.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health.

We could do that, but it's orthogonal to the real problem at hand: how does an actual instance of Veneur know that it's overloaded? It can't rely on another service to feed it those metrics, because those metrics would be dropped if it's overloaded.

It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Veneur's in a special place because it's intended to run on every pod, and, in addition, if it gets overloaded, then we lose all visibility into the pod.

This may make more sense once I get around to writing up the "Veneur + Kubernetes" post, but in short: Veneur needs to be able to evaluate the health of itself within a pod, without talking to any external services. We could (optionally) have system agents checking this information as well, but we can't rely on them for determining if Veneur itself is overloaded or not.

@cory-stripe
Copy link
Contributor

Ok. This still makes my SRE brain itch, but i'm in FUDland now and that's not helpful.

Copy link
Contributor

@cory-stripe cory-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits, you can ignore them if you want.

Thanks for the discussion!

@@ -191,3 +195,81 @@ func startSSFUnix(s *Server, addr *net.UnixAddr) <-chan struct{} {
}()
return done
}

type SoftnetData struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes logical sense here, but perhaps it's better placed in something os metric centric package-wise?

// SoftnetStat provides information about UDP traffic, as provided
// on Linux machines by /proc/net/softnet_stat. Its behavior on other
// operating systems is undefined.
func SoftnetStat() (SoftnetData, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely should check runtime for Linux and squawk? (Although the caller could do that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants