Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exometer_proc timeout #99

Open
iguberman opened this issue Oct 20, 2017 · 11 comments
Open

exometer_proc timeout #99

iguberman opened this issue Oct 20, 2017 · 11 comments

Comments

@iguberman
Copy link

This started happening when we started testing the system under high load. Never seen this problem until then, and it happens a few minutes into our high-load test. The problem is, exometer never recovers after that and we don't get any stats anymore. I tried significantly increasing this timeout in our fork of exometer_core, but that didn't help.

2017-10-20 17:34:35.715 [error] <0.702.0> CRASH REPORT Process exometer_report with 0 neighbours crashed with reason: timeout in exometer_proc:call/2 line 131
2017-10-20 17:34:35.780 [error] <0.699.0> Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.702.0> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated```
@uwiger
Copy link
Member

uwiger commented Oct 21, 2017

Hi Irina!

Odd that it's the exometer_report process that crashes in an exometer_proc:call/2 call – it should never make such a call in the first place. The second problem is of course that it doesn't restore operation after restarting.

Can you reproduce the problem?

@uwiger
Copy link
Member

uwiger commented Oct 21, 2017

Hmm, I take that back (well, not the "should never" ...). In do_report/2 [handle_info(report_batch, ...)], which is called from the handle_info() clauses for report and report_batch, exometer_report calls exometer:get_value/2, which may end up doing an exometer_proc() call to a probe, which would use exometer_proc:call().

Chances are then, that one of your probes becomes unresponsive. Of course, the exometer_report process should not crash because of that – it should not even be blocked. I'll have to look into that.

@iguberman
Copy link
Author

iguberman commented Oct 21, 2017

Thanks so much Ulf!

That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, exometer_proc crashes and the stats disappear.

What I did however as a temp workaround I removed timeout completely from exometer_proc to receive without any timeout and now it doesn't crash at all (possibly I'm thus hiding the real cause of original issue under the rug).

    MRef = erlang:monitor(process, Pid),
    Pid ! {exometer_proc, {self(), MRef}, Req},
    receive
        {MRef, Reply} ->
            erlang:demonitor(MRef, [flush]),
            Reply;
        {'DOWN', MRef, _, _, Reason} ->
            error(Reason)
%%    after 5000 ->
%%            error(timeout)
    end.

@iguberman
Copy link
Author

iguberman commented Oct 30, 2017 via email

@ruttenberg
Copy link

I have observed this problem intermittently. This is an example:

2018 Feb 27 18:02:08 52.23.248.173 erlang: error | gen_server exometer_report terminated with reason: timeout in exometer_proc:call/2 line 131
2018 Feb 27 18:02:13 52.23.248.173 erlang: error | CRASH REPORT Process exometer_report with 0 neighbours exited with reason: timeout in exometer_proc:call/2 line 131 in gen_server:terminate/7 line 812
2018 Feb 27 18:02:14 52.23.248.173 erlang: error | Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.1178.23> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated

exometer_report is restarted, and the cycle continues.
This repeats for a while until the VM crashes.

At the point this begins, I notice the following in general VM statistics:
dramatic drop in gc words reclaimed, IO, and reductions.
dramatic increase in scheduler utilization, message queue count.

Please let me know what other information might be helpful in solving this problem.

@uwiger
Copy link
Member

uwiger commented Feb 28, 2018

@ruttenberg, that drop you observe, is it from stats reported by exometer or the actual numbers presented by the VM?

@ruttenberg
Copy link

That is from stats reported by exometer.

@uwiger
Copy link
Member

uwiger commented Feb 28, 2018

Ok. I have the flu right now. I don't think I'll be able to do much about this today, but will gladly accept a PR that at least catches calls to exometer:get_value/2. One could also argue that get_value() shouldn't crash in the first place. This would also be an acceptable solution.

As to the problem of reporting drift if a particular metric keeps timing out, that's perhaps best noticed in the analytics backend and simply fixing the probe that keeps hanging.

@ruttenberg
Copy link

@uwiger
I hope you feel better soon.
I think you overestimate my understanding of the exometer implementation. I don't really understand what you are suggesting for the PR.

Thanks.

--Jon Ruttenberg

@uwiger
Copy link
Member

uwiger commented Feb 28, 2018

@ruttenberg, for example:

https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1438

The exometer_report process expects exometer:get_value/2 not to fail. Reading the docs, this seems like an appropriate assumption, but exometer_report could still wrap these in a try ... catch here.

More appropriate places to put try ... catch might be (fixing the get_value/2 function):
https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer.erl#L362

and https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_probe.erl#L592 (as well as #L595)

@ruttenberg
Copy link

@uwiger
Thanks. I will take a look at those.

Some additional information:
I have not implemented any probes. The only probe is the built-in histogram.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants