-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNSTAP socket shows errors after operating for some time #20744
Comments
Hey John, what's the
|
The source is dnsdist/dnstap (latest release.) It may be the case that dnsdist/dnstap sends an error of some sort, but that should be only for one event. And vector goes many days without falling into this error condition, with the same load all the time. It would be highly unexpected for dnsdnst/dnstap to suddenly shift to an "all-errors" modality, since there is no backwards data sent from vector to dnsdist - it should be just "ack" packets. Also, the fact that the stats don't seem to change, just the error counters rise, seems to be unusual and points away from dnstap being the source of the issue. Lastly, there is no increase in packet or byte count between the DNSTAP origin and this vector host as indicated via my node graphs, so I don't think this is a dnsdist-side issue, or at least not the ongoing large jump in error counters. |
Update: In my last event instance today, "systemctl restart vector" did clear the error condition. |
Looking at the code, I think @james-stevens is right - it comes from the frame reader. The odd thing to me here is that there is no other effect than higher error count, since this should pretty much stop any events from flowing - error frames are just skipped. |
We've been running So what we get is back-pressure at the sink, which causes dropped frames at the
Our servers have plenty of spare capacity & none of the vector threads are over 20% CPU, so its really not clear where the bottleneck is - but that's a different problem!!! |
Thanks @james-stevens. Do you have a way of testing |
Another point to notice that I haven't mentioned: the number of errors, once started, seems to be remarkably "flat" despite the number of events being sent to the DNSTAP source fluctuating quite a bit, at least according to what is being recorded by the prometheus metrics in Vector. We shift from 16k events per second to 26k events per second being sent by dnsdist to the DNSTAP consumer, but the error rate shown in the graphs above stays nearly exactly at 18.2k errors per second - the values are not as important as the fact that there is no variation in the error rate despite variation in the dnstap message transmission rate. Just as weird, the rate of events per second downstream from the dnstap ingestion remap fluctuates normally with volume, so the rest of the system does not "see" these errors that are triggering faults that are showing up in the graph associated with vector_component_errors_total. |
what metric(s) are you looking at in that graph? we're not getting the metric What I check is that |
Yes, the metric is "vector_component_errors_total". Our expire_metrics_secs is set to 22265 (yes, very long because we have metric aggregations that go for >6 hours in some cases.) |
I can only see errors related to scraping metrics when under load - we use vector to aggregate all the metrics on the server so Prometheus doesn't need to be reconfigured every time we change things about. We also had a few "disk full" incidents when running 300Kq/s tests for five mins. 350MB/s disk writes adds up fast. The way we look for dropped frames is to get We then wrote some LUA to parse the log line & convert it into three new dnsdist metrics - dnstap frames sent, dropped & failed. No idea the difference between those last two, but we don't see any failed, just dropped when the load gets too high. |
A note for the community
Problem
After a period of time, a heavily-loaded DNSTAP ingestion system shows socket errors on connected sockets that are transmitting DNSTAP data to it. Uncertain if this is a reporting error or a real error.
Configuration
Version
vector 0.39.0 (x86_64-unknown-linux-gnu)
Debug Output
Example Data
No response
Additional Context
This particular system I am testing on has two high-volume (>14kqps each) ingestion streams and two low-volume streams (~30qps each) connected to two different dnsdist instances and feeding two different contexts. After some period of time, errors will jump from zero to around 180,000 per second (which doesn't make sense? How can there be more errors than ingestion elements? I'm graphing with: "irate(vector_component_errors_total[5m])" in Prometheus/Grafana) on the high-volume context that is importing from the dnstap source. This I suspect is one of the two server sockets showing signs of the problem. Then after a random number of hours (often measured in days) the number of errors will jump to around double that amount. (see graph, which shows one stream as being "bad" for several days, then a spike when the other large stream starts showing errors.)
Strangely, I see no evidence of this increased error rate on any of the downstream components, either looking in their graph data (I graph pretty much everything coming out of Vector) nor on the actual output generated at the end of the pipelines. Are these errors real? The debug messages certainly seem to indicate that there is a problem.
Other items of note: Reloading vector does not cure the problem. Even more unusually, "systemctl restart vector" also does not cure the problem. Only "systemctl stop vector;systemctl start vector" causes the error graph to drop to zero and the error messages to stop being generated.
This machine is fairly heavily loaded, performing only Vector tasks (>85% utilization on htop across all cores at some parts of the day.)
There are other DNSTAP parsing errors seen sporadically, but they seem to be related to malformed DNS packets or elements in the DNSTAP message that are not yet parsed fully. I did not include those log lines.
I have other protobuf sockets operating on this system that are fairly busy (four at ~3kqps each) but which are not DNSTAP. I also have many v6 kafka streams as sinks.
References
No response
The text was updated successfully, but these errors were encountered: