Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"impossible" results with graylog-plugin-metrics/graphite #7

Open
rbroemeling opened this issue Mar 17, 2016 · 12 comments
Open

"impossible" results with graylog-plugin-metrics/graphite #7

rbroemeling opened this issue Mar 17, 2016 · 12 comments

Comments

@rbroemeling
Copy link

We have just started using graylog-plugin-metrics to relay metrics to our graphite instance, and we're finding extremely curious results, like this (the times on this are at GMT-0600, so 10:00 on the graph is 16:00 GMT):

request_time_vs_upstream_response_time

The graph is built from graylog absorption of nginx access.log entries, which have request_time and upstream_response_time fields. Since the upstream response time must be completely contained within the request time, request_time should be strictly greater than upstream_response_time (it is not possible for a request to have a greater request_time than an upstream_response_time).

Note that it is explicitly possible for a request to not have an upstream_response_time (if it is nginx serving a static file), but we've explicitly adjusted for that -- the graylog stream that we are dealing with restricts events to only those that include an upstream_response_time.

I've validated that within graylog itself, the reality holds and that upstream_response_time is always less than request_time:

From 16:07 to 16:08, mean request_time is 193.43 and mean upstream_response_time is 175.78.
From 16:08 to 16:09, mean request_time is 190.33 and mean upstream_response_time is 163.96.
From 16:09 to 16:10, mean request_time is 219.43 and mean upstream_response_time is 183.53.
From 16:10 to 16:11, mean request_time is 215.08 and mean upstream_response_time is 186.41.

... thus, the data in graylog looks good. The request_time and upstream_response_time fields are sent (as histograms, via graylog-plugin-metrics) to graphite, and result in the following graphite data, however:

$ whisper-fetch.py --from=1458230820 --until=1458231060 request_time/mean.wsp 
1458230880  192.900000
1458230940  152.620000
1458231000  221.990000
1458231060  206.160000

$ whisper-fetch.py --from=1458230820 --until=1458231060 upstream_response_time/mean.wsp 
1458230880  171.720000
1458230940  223.010000
1458231000  202.560000
1458231060  191.670000

As you can see, the data for 1458230940 (16:09 GMT) shows a significantly higher mean upstream_response_time than request_time. This should not be possible.

As the data in graylog appears correct, I can only assume that the metrics plugin is applying a transform to the data as it builds the necessary histogram, and that this is leading to incorrect output being sent to graphite.

Please let me know if there is further debugging that I can do on this front to help aid in diagnosing this issue.

@mariussturm
Copy link

What is your run_rate set to and what happens when you decrease it?

@rbroemeling
Copy link
Author

run_rate is currently set to 60 seconds, which aligns with our graphite configuration (one datapoint per minute).

I'll try decreasing it to 30 seconds and report back with the result.

@rbroemeling
Copy link
Author

After making that change (to a run_rate of 30), I can still see areas where our upstream_response_time is noticeably higher than our request_time. For example:

run-rate-30

I switched to a run-rate of 30 at approximately 12:52 (GMT-06:00) on the graph, and there is a noticeable drift at 13:31-13:33. Would you like me to excise specific details/numbers for the drift that we are seeing with a different run_rate?

@mariussturm
Copy link

I think this is a sampling problem, the metric plugin itself is collecting events till the run_rate is over and sends the accumulated result to Graphite. So when there is a unlucky timing of events, maybe a high value arrives at the very end or beginning of a run_rate interval in can negatively influence the interval. The lower run_rate reduce the error situations but never fully eliminates them. You can run more tests with low run_rates like 10s or 5s and see if that helps. But I guess there is not an easy fix that fully resolves this issue.

@rbroemeling
Copy link
Author

@mariussturm I thought about that, but I don't see how that could be the problem if we assume that a record is never "half-accounted" before the metrics are sent.

By this I mean: each individual request is in a single GELF event. Assuming that the GELF event is taken as a whole, both the upstream_response_time and request_time metrics will be updated at the same time -- and since for every individual GELF event the foundational rule that "upstream_response_time is always less than or equal to request_time" holds, there shouldn't be any way that an unlucky timing of events can result in statistics that are invalid in this way.

Now, if my underlying assumption is incorrect and the metrics are updated asynchronously, even for a single GELF event, then that would open up a race condition that could allow that problem to occur -- but as long as the metrics are updated at the same time/within the same interval for each individual GELF event, then there shouldn't be any way for event (or reporting) timing to cause this problem.

So the question is whether the metric updates for a single GELF event can span reporting intervals or not (i.e. for a given GELF event, whether upstream_response_time can be updated and request_time not updated when a report is issued to graphite)? Is that the case?

@mariussturm
Copy link

This could happen, the plugin is running for every output process, so a timely mixup is possible. You could try to set outputbuffer_processors to 1 in your server config and check if that makes a difference. (This slows down the server and is no permanent solution!)

@rbroemeling
Copy link
Author

Hmm. Ok, thanks -- this is rather disquieting, honestly. I think that what we'll do is remove this plugin from use and change to doing elasticsearch queries ("after the fact") and then taking the results and infusing them directly into Graphite. So, for example, a CRON job that runs once per minute and pulls data from the last minute directly out of elasticsearch and forwarding it to graphite. That way we can expect to have a little more accuracy in the statistics gathered.

I'll close this issue off, as there doesn't seem to be a good working solution. As an aside, though, do you know if any such script already exists (for the querying of graylog metrics directly out of elasticsearch and forwarding them into graphite)?

@mariussturm
Copy link

Did you get the same results with a single output process? We can fix this only when we have an understanding of the exact problem.

@rbroemeling
Copy link
Author

@mariussturm I never tried setting outputbuffer_processors to 1 to see if it would fix the issue, I was under the impression that you didn't recommend that as a "real" fix. I'll try it as a debugging step, in any case, and we'll see what happens.

What about graylog, specifically, is slowed down by outputbuffer_processors = 1?

@rbroemeling rbroemeling reopened this Mar 18, 2016
@rbroemeling
Copy link
Author

@mariussturm I turned outputbuffer_processors = 1 (from the default of 3) at 12:13 mountain, and from 12:38 mountain through to 13:36 mountain, I have 5 instances where upstream_response_time is greater than request_time -- so the problem hasn't been solved by outputbuffer_processors = 1. Is there anything else that you would like me to try?

Update: I am reasonably certain that setting outputbuffer_processes = 1 hasn't helped at all -- from 13:13 through to 14:12 we have ~13 instances of upstream_response_time being higher than request_time, sometimes very significantly:

outputbuffer_processes_1

Should I revert the outputbuffer_processors change for performance reasons?

@mariussturm
Copy link

Ok thanks for the additional informations, yes you can revert the change. I have to try to reproduce the results here to dig deeper into the problem.

@mchtech
Copy link

mchtech commented Apr 11, 2019

I encountered same problem on my nginx servers. 【15% requests that $upstream_response_time > $request_time about 1ms-2ms】

https://forum.nginx.org/read.php?2,283679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants