DNSTAP socket shows errors after operating for some time #20744

johnhtodd · 2024-06-26T03:01:34Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a period of time, a heavily-loaded DNSTAP ingestion system shows socket errors on connected sockets that are transmitting DNSTAP data to it. Uncertain if this is a reporting error or a real error.

Configuration

in this config, I use variables for configuration.  
VECTORIPV4=10.10.3.100
VECTORDNSTAPPORT=59001


  main_dnstap_1:
    type: "dnstap"
    mode: tcp
    address: "${VECTORIPV4:?err}:${VECTORDNSTAPPORT:?err}"
    permit_origin: ["${VECTORIPV4NETWORK:?err}", "127.0.0.1/32"]
    lowercase_hostnames: true

Version

vector 0.39.0 (x86_64-unknown-linux-gnu)

Debug Output

This is not a crash; backtrace not included, though if desired I can attempt it.

2024-06-24T15:18:42.687252Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: TCP socket error. error=bytes remaining on stream peer_addr=10.10.3.232:35402 error_type="connection_failed" stage="processing" internal_log_rate_limit=true
2024-06-24T15:18:42.687482Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: Internal log [TCP socket error.] is being suppressed to avoid flooding.

Example Data

No response

Additional Context

This particular system I am testing on has two high-volume (>14kqps each) ingestion streams and two low-volume streams (~30qps each) connected to two different dnsdist instances and feeding two different contexts. After some period of time, errors will jump from zero to around 180,000 per second (which doesn't make sense? How can there be more errors than ingestion elements? I'm graphing with: "irate(vector_component_errors_total[5m])" in Prometheus/Grafana) on the high-volume context that is importing from the dnstap source. This I suspect is one of the two server sockets showing signs of the problem. Then after a random number of hours (often measured in days) the number of errors will jump to around double that amount. (see graph, which shows one stream as being "bad" for several days, then a spike when the other large stream starts showing errors.)

Strangely, I see no evidence of this increased error rate on any of the downstream components, either looking in their graph data (I graph pretty much everything coming out of Vector) nor on the actual output generated at the end of the pipelines. Are these errors real? The debug messages certainly seem to indicate that there is a problem.

Other items of note: Reloading vector does not cure the problem. Even more unusually, "systemctl restart vector" also does not cure the problem. Only "systemctl stop vector;systemctl start vector" causes the error graph to drop to zero and the error messages to stop being generated.

This machine is fairly heavily loaded, performing only Vector tasks (>85% utilization on htop across all cores at some parts of the day.)

There are other DNSTAP parsing errors seen sporadically, but they seem to be related to malformed DNS packets or elements in the DNSTAP message that are not yet parsed fully. I did not include those log lines.

I have other protobuf sockets operating on this system that are fairly busy (four at ~3kqps each) but which are not DNSTAP. I also have many v6 kafka streams as sinks.

References

No response

The text was updated successfully, but these errors were encountered:

james-stevens · 2024-06-27T16:45:58Z

Hey John, what's the dnstap source ?

error=bytes remaining on stream sounds to me like vector thought there were errors in the frame format

johnhtodd · 2024-06-27T18:30:44Z

Hey John, what's the dnstap source ?

error=bytes remaining on stream sounds to me like vector thought there were errors in the frame format

The source is dnsdist/dnstap (latest release.) It may be the case that dnsdist/dnstap sends an error of some sort, but that should be only for one event. And vector goes many days without falling into this error condition, with the same load all the time. It would be highly unexpected for dnsdnst/dnstap to suddenly shift to an "all-errors" modality, since there is no backwards data sent from vector to dnsdist - it should be just "ack" packets. Also, the fact that the stats don't seem to change, just the error counters rise, seems to be unusual and points away from dnstap being the source of the issue. Lastly, there is no increase in packet or byte count between the DNSTAP origin and this vector host as indicated via my node graphs, so I don't think this is a dnsdist-side issue, or at least not the ongoing large jump in error counters.

johnhtodd · 2024-06-27T18:40:07Z

Update: In my last event instance today, "systemctl restart vector" did clear the error condition.

esensar · 2024-07-02T17:08:11Z

Looking at the code, I think @james-stevens is right - it comes from the frame reader. The odd thing to me here is that there is no other effect than higher error count, since this should pretty much stop any events from flowing - error frames are just skipped.

james-stevens · 2024-07-03T09:39:34Z

We've been running dnsdist->vector using the dnstap source and had no framing issues up to 200K event/s - the problem we have is we've not been able to find any vector sink that can sink the data at anywhere near that rate, except blackhole !!! #20739

So what we get is back-pressure at the sink, which causes dropped frames at the dnsdist end of the dnstap socket. dnsdist will deliberately drop frames instead of blocking as that would disrupt the DNS service. But it doesn't cause framing issues.

dnsdist only makes a single dnstap connection to vector, for all its threads - whereas PowerDNS Recursor makes one dnstap per-thread - obv the latter makes scaling much easier!!

Our servers have plenty of spare capacity & none of the vector threads are over 20% CPU, so its really not clear where the bottleneck is - but that's a different problem!!!

esensar · 2024-07-03T10:01:28Z

Thanks @james-stevens. Do you have a way of testing dnstap in unix mode? Just to help pinpoint the issue. If it only occurs in tcp mode then this and #20739 might be related.

johnhtodd · 2024-07-03T14:22:16Z

Another point to notice that I haven't mentioned: the number of errors, once started, seems to be remarkably "flat" despite the number of events being sent to the DNSTAP source fluctuating quite a bit, at least according to what is being recorded by the prometheus metrics in Vector. We shift from 16k events per second to 26k events per second being sent by dnsdist to the DNSTAP consumer, but the error rate shown in the graphs above stays nearly exactly at 18.2k errors per second - the values are not as important as the fact that there is no variation in the error rate despite variation in the dnstap message transmission rate. Just as weird, the rate of events per second downstream from the dnstap ingestion remap fluctuates normally with volume, so the rest of the system does not "see" these errors that are triggering faults that are showing up in the graph associated with vector_component_errors_total.

james-stevens · 2024-07-04T09:28:42Z

what metric(s) are you looking at in that graph?

we're not getting the metric vector_component_errors_total which I suspect is cos we have expire_metrics_secs set to five mins - so if we're seeing no errors that metric could just get removed.

What I check is that vector_buffer_sent_events_total matches what I expect to the appropriate sink.

johnhtodd · 2024-07-04T15:58:00Z

Yes, the metric is "vector_component_errors_total". Our expire_metrics_secs is set to 22265 (yes, very long because we have metric aggregations that go for >6 hours in some cases.)

james-stevens · 2024-07-04T17:15:55Z

I can only see errors related to scraping metrics when under load - we use vector to aggregate all the metrics on the server so Prometheus doesn't need to be reconfigured every time we change things about.

We also had a few "disk full" incidents when running 300Kq/s tests for five mins. 350MB/s disk writes adds up fast.

The way we look for dropped frames is to get dnsdist to log - its one of the items it logs.

We then wrote some LUA to parse the log line & convert it into three new dnsdist metrics - dnstap frames sent, dropped & failed. No idea the difference between those last two, but we don't see any failed, just dropped when the load gets too high.

johnhtodd added the type: bug A code related bug. label Jun 26, 2024

jszwedko added the source: dnstap Anything `dnstap` source related label Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNSTAP socket shows errors after operating for some time #20744

DNSTAP socket shows errors after operating for some time #20744

johnhtodd commented Jun 26, 2024

james-stevens commented Jun 27, 2024

johnhtodd commented Jun 27, 2024 •

edited

Loading

johnhtodd commented Jun 27, 2024

esensar commented Jul 2, 2024

james-stevens commented Jul 3, 2024

esensar commented Jul 3, 2024

johnhtodd commented Jul 3, 2024

james-stevens commented Jul 4, 2024

johnhtodd commented Jul 4, 2024

james-stevens commented Jul 4, 2024

DNSTAP socket shows errors after operating for some time #20744

DNSTAP socket shows errors after operating for some time #20744

Comments

johnhtodd commented Jun 26, 2024

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

james-stevens commented Jun 27, 2024

johnhtodd commented Jun 27, 2024 • edited Loading

johnhtodd commented Jun 27, 2024

esensar commented Jul 2, 2024

james-stevens commented Jul 3, 2024

esensar commented Jul 3, 2024

johnhtodd commented Jul 3, 2024

james-stevens commented Jul 4, 2024

johnhtodd commented Jul 4, 2024

james-stevens commented Jul 4, 2024

johnhtodd commented Jun 27, 2024 •

edited

Loading