-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to expose /proc/softnet_stat data #291
base: master
Are you sure you want to change the base?
Conversation
Gerald Rule: Copy Observability on Veneur and Unilog pull requestscc @stripe/observability |
Neat! Buuuut, this doesn't emit any metrics? (yet?) |
Ah yeah, I should have tagged this as [wip]. This doesn't actually expose the metrics yet; it just adds the function that does the hard part. |
Wouldn't this be better as an agent check rather than adding it in Veneur? |
I don't think so. This is important enough to Veneur itself that we should provide the means for this functionality whether or not Veneur is running alongside the Datadog agent. |
I'm unsure why this data merits being here but others don't. This feels like a very steep, slippery slope into being a system agent… |
Here's the way I look at the dividing line: Veneur should be able to diagnose (but not fix) its own health without external involvement. In other words, we should be able to have nothing more than a process supervisor running Veneur (which detects process crashes), and Veneur itself is responsible for coming up with the correct answers to questions like "am I unhealthy, overloaded, or otherwise not behaving properly?". If the answer to those is "yes", we can then use external systems to mitigate automatically. The easiest way we do this already today is to crash the process (and have daemontools/systemd pick up the rest), but obviously there are more advanced options as well. (Veneur is not responsible for its own controllability - it's only responsible for coming up with the answers to those questions). This is more or less the philosophy of the Operator principle in Kubernetes - while we aren't moving (everything) to Kubernetes (yet), we are adopting the Operator principle more or less across our infrastructure even where we're not running Kubernetes. So, the reason we want (a) we don't want to tie Veneur needlessly to the Datadog agent; and (b) that prevents us from being able to leverage this data in Veneur's |
Ok, this is definitely convincing. In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams. What data would we peek at to get this for TCP? Are there other metrics we consider influential to health in this regard? I'm being pedantic here because this takes us into new territory and opens the door for other things. Should veneur monitor disk fullness or other types of things that emit to |
Sure - if there's an equivalent for domain sockets (buffer usage?), we should add that too!
Offhand, I'm not really sure what the equivalent measure is for TCP.
No - veneur doesn't (directly) write to disk. Even if it did, we wouldn't want to track that independently of the errors it would presumably report when attempting to write. The reason that these packet drops are important is because Veneur is a bit of a special case - it's consuming the observability primitives themselves, so it needs to be able to know if it's systematically missing any. In general, I don't think services should be reading In a pure-Kubernetes world, you're right that we would also probably want a DaemonSet that read this data on every node, to aid in scheduling pods on nodes (as opposed to scaling application pods). But we're not in a pure-Kubernetes world, and also that's still at most two places (so, some duplication, but not a ton). And I think that DaemonSet would be something like |
Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur. I agree with your reasons for this, but it feels weird to have one of our ingress mechanisms measured in this way but none of the others. For example we know of active TCP use in the wild. We don't have this problem in our infrastructure as seen by our passive monitoring. It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this. Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health. For example if a system agent emitted this data into a local veneur and there were a mechanism in Veneur to "peek" into that data locally — think of Prometheus' local endpoint for scraping — then a separate process on the host could react to failed SLOs for network stack signals and mark health. In other words, I guess I still feel like this is a very system agenty thing. :/ |
(continuing from Slack)
Yup - I originally wrote this up because of an issue we were investigating, involving hypothesized overloaded instances.
We could do that, but it's orthogonal to the real problem at hand: how does an actual instance of Veneur know that it's overloaded? It can't rely on another service to feed it those metrics, because those metrics would be dropped if it's overloaded.
Veneur's in a special place because it's intended to run on every pod, and, in addition, if it gets overloaded, then we lose all visibility into the pod. This may make more sense once I get around to writing up the "Veneur + Kubernetes" post, but in short: Veneur needs to be able to evaluate the health of itself within a pod, without talking to any external services. We could (optionally) have system agents checking this information as well, but we can't rely on them for determining if Veneur itself is overloaded or not. |
Ok. This still makes my SRE brain itch, but i'm in FUDland now and that's not helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits, you can ignore them if you want.
Thanks for the discussion!
@@ -191,3 +195,81 @@ func startSSFUnix(s *Server, addr *net.UnixAddr) <-chan struct{} { | |||
}() | |||
return done | |||
} | |||
|
|||
type SoftnetData struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes logical sense here, but perhaps it's better placed in something os metric centric package-wise?
// SoftnetStat provides information about UDP traffic, as provided | ||
// on Linux machines by /proc/net/softnet_stat. Its behavior on other | ||
// operating systems is undefined. | ||
func SoftnetStat() (SoftnetData, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely should check runtime
for Linux and squawk? (Although the caller could do that too.
Summary
On Linux systems, we have access to network performance data at
/proc/softnet_stat
. This allows us to do things like monitor the number of packets we're dropping.This data is almost wholly undocumented, except in the source code. The packagecloud blog has a great post that goes into more detail.
/proc/softnet_stat
tells us whether the NIC is out of memory. This contrasts with/proc/net/snmp
, which gives usInErrors
, which tells us when the socket's receive buffer is full.If needed, the latter can be configured with sysctl (
net.core.rmem_{max,default}
). This is what we're doing when we setunix.SetsockoptInt
from within Veneur.Motivation
Verify that UDP network traffic is not a bottleneck
Test plan
Rollout/monitoring/revert plan
r? @cory-stripe
cc @stripe/observability