Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: statsrelay dropping packet?. #25

Open
cjagus opened this issue Jan 24, 2020 · 7 comments
Open

Question: statsrelay dropping packet?. #25

cjagus opened this issue Jan 24, 2020 · 7 comments

Comments

@cjagus
Copy link

cjagus commented Jan 24, 2020

I have started testing statsrelay in our envioment by using statsd repeater. Onething I noticed is difference in metrics recieved in graphite vs statsd proxy.

image
statsd repeater -> statsrelay -> statsd > graphite ^ [currently using one statsrelay and statsd]

And the difference is huge when we add more statsd backends.
The same graphs works fine when I replace statsrelay with statsd. [statsd repeater -> statsd > graphite]
Any thoughts on this @jjneely @szibis

@jjneely
Copy link
Owner

jjneely commented Jan 24, 2020

What are you measuring, exactly here? Make sure you are counting received metrics and not received packets. Some implementations don't make that distinction and statsrelay tries to pack UDP packets has much as it can.

There are, frankly, a lot of ways we could be leaking UDP packets. Remember, UDP doesn't guarantee delivery, and the StatsD design aims to collect a statistically significant sample of the data points rather than accounting for each and every metric end to end.

One of the reasons I wrote this was because the node implementation of Etsy's StatsD is really quite bad at dropping packets. You might want to look at running an implementation that's, uhh, more robust like Statsite.

https://github.com/statsite/statsite

Ok, let's figure out where you are dropping packets. Look at /proc/net/udp or /proc/net/udp6 on each of the machines in your setup. You'll see a row for each open UDP port the kernel has setup and listening on. One column is drops which counts the number of packets that the kernel ring buffer has dropped because the application (statsd/statsrelay/etc) wasn't able to read off the ring buffer fast enough to keep up with incoming traffic. That will most likely identify where the leaking is in your stack. Fixing is a matter of tuning where the leak is.

@cjagus
Copy link
Author

cjagus commented Jan 25, 2020

Thanks for your response, In our env, we have statsd installed on all machines and aggregate locally and send to Graphite cluster, most of the applications are autoscaling so we don't need per instance metrics. Currently I'm forwarding a single application[consists of 10-20 ec2 machines] metrics using statsd repeater, so the throghput is not that hight[10k-30k per min],

So if you check this graph for API 2xx[application metrics]

image
statsd repeater -> statsrelay -> statsd > graphite ^ [currently using one statsrelay and statsd]
there is a difference in metric recieved via statsrelay and directly forwarding to graphite.

And if I stop the statsdrelay and directly forward[statsd repeater -> statsd > graphite] I don't see this drift in the graph, also I don't see any drops in /proc/net/udp or /proc/net/udp6 [had increased the syctl before based] @jjneely

@jjneely
Copy link
Owner

jjneely commented Jan 25, 2020

I'd agree that the traffic you have here should be low enough to work even in un-tuned environments.

What expressions are you graphing in the Grafana graphs?

How are you running Statsrelay? What's the script, arguments, options, etc that you are giving Statsrelay?

@cjagus
Copy link
Author

cjagus commented Jan 27, 2020

Graphite expressions are pretty basic

Eg : alias(sumSeries(app.webapp.*.timers.apigateway.people__client.store.count), 'graphite')

statsrelay startup script [/etc/init/statsrelay.conf ]:

description "Statsrelay"
start on (local-filesystems and net-device-up IFACE!=lo)
stop on [!12345]

limit nofile 1048576 1048576
oom score -1
respawn
respawn limit 10 5

exec /opt/statsd/packages/statsrelay --port 7125 --bind 10.1.10.92 --prefix statsrelay-proxy --sendproto="TCP"\ [tried with default UDP aswell]
     127.0.0.1:9125 \

statsd config :

{
  address: "0.0.0.0",
  mgmt_address: "0.0.0.0",
  mgmt_port: "9126",
  dumpMessages: false,
  flushInterval: 60000,
  graphitePort: 2003,
  graphiteHost: "graphite",
  port: "9125",
  server: './servers/tcp',
  backends: [ "./backends/graphite" ],
  prefixStats: "statsd_0",
  deleteCounters: true,
  deleteGauges: true,
  deleteIdleStats: true,
  percentThreshold: [90, 99],
  graphite: {
    legacyNamespace: false,
    globalPrefix: "app.statsd.statsd-1"
  }
}

@jjneely

@jjneely
Copy link
Owner

jjneely commented Jan 27, 2020

exec /opt/statsd/packages/statsrelay --port 7125 --bind 10.1.10.92 --prefix statsrelay-proxy --sendproto="TCP"\
     127.0.0.1:9125 \

StatsD binds to 0.0.0.0, but you are binding statsrelay to a specific IP address. I'm wondering if you are perhaps missing packets from a local version of the application here?

alias(sumSeries(app.webapp.*.timers.apigateway.people__client.store.count), 'graphite')

What would be helpful is to look at the metrics reported by StatsD and Statsrelay itself and see if the daemons are encountering the same number of metrics. That will give us a better idea about where the leaking is happening. There should be a statsrelay.statsProcessed that StatsRelay will emit as a Counter that will report how many statsd metrics/samples it is receiving.

Likewise, Statsd will have a similar counter that it generates internally and emits that counts the number of metrics it has seen. (And I forget what the metric name is, its been so long since I've used Etsy's StatsD.)

These counters over time would be what I would compare to fully understand where the leak is.

@cjagus
Copy link
Author

cjagus commented Jan 29, 2020

Attaching >
image

statsProcessed.count vs statsd.metrics_received.count

Also made changes to statsrelay to bind on 0.0.0.0.

@jjneely
Copy link
Owner

jjneely commented Jan 29, 2020

CJ,

Those numbers suggest that you are dropping 0.4% of packets. Which is a LOT better than the previous numbers suggesting around 10% drop. My usual goal in a very high throughput StatsD setup was to keep UDP and metric drop below 1%.

Have you tried running StatsRelay in verbose mode and see if it is dropping statsd metrics that do not parse correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants