Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

Open
alco opened this issue Jun 29, 2015 · 7 comments
Labels

Comments

@alco
Copy link
Contributor

alco commented Jun 29, 2015

We have a few reporters configured with the interval of 1000 ms using exometer_reporter_statsd. Whenever the system time jumps forward, the exometer_report_statsd can't keep up with the amount of incoming {exometer_report, ...} messages and, given enough metrics, the memory used by the VM will keep growing indefinitely.

The issue originates from this line where the second argument becomes negative after a forward time jump. The underlying issue is that the internal timestamp exometer_report is keeping can take a long time to catch up with the system's time because it is only increasing in multiples of the specified report interval here.

So if we have the report interval of 1000 ms and the time jumps forward by one hour, it will generate at least 3600 reports with interval 0 until it catches up. In a real app where we have quite a few metrics and we haven't seen it catch up at all when using exometer_report_statsd: the message queue would grow continuously meanwhile the Erlang process would be eating 200% CPU. When I log the value of Time - tdiff(T1, T0) to the console, it goes down a bit and then jumps back up, so it's kind of stuck in a loop.

Using exometer_report_tty instead does not result in that degenerate behaviour and it would catch up rather quickly back to the current time.

We would expect the exometer_report_statsd reporter to only send a single report in the case of the time jump because there is no useful info that can be collected between two reports which have interval 0 between them.

@uwiger
Copy link
Member

uwiger commented Jun 29, 2015

I will take a look. The purpose of the adjust_interval/2 function obviously isn't to reduce the interval to zero, but to compensate for drift due to scheduling delays etc. Adapting to clock changes probably requires also adjusting T0, which represents the calculated time when the report was supposed to be issued.

@alco
Copy link
Contributor Author

alco commented Sep 25, 2015

Hi. Any updates on this front?

@uwiger
Copy link
Member

uwiger commented Sep 25, 2015

No, apologies.

I noticed that I had an un-pushed branch where I had made some simple corrections to clock changes. I just pushed and created a PR (#44), but will say honestly that I haven't tested except verify that it doesn't break the (unmodified) test suite.

@alco
Copy link
Contributor Author

alco commented Oct 5, 2015

Hi @uwiger. I have tried your branch and it seems to crash the reporter soon after application start.

I've added a few printf inside adjust_interval() like this:

adjust_interval(Time, T0) ->
    io:format("adjust_interval(~p, ~p)~n", [Time, T0]),
    T1 = os:timestamp(),
    case tdiff(T1, T0) of
        D when D > Time ->
            io:format("D = ~p, T1 = ~p~n", [D, T1]),
            %% Most likely due to clock adjustment
            {Time, T1};
        D ->
            io:format("D = ~p, T0 = ~p~n", [D, T0]),
            {D, T0}
    end.

and here's the output I see when starting my app (the error report is using Elixir syntax):

adjust_interval(1000, {1444,47160,1116602})
D = 1, T0 = {1444,47160,1116602}
adjust_interval(1000, {1444,47160,2116602})
D = -996, T0 = {1444,47160,2116602}

15:12:41.136 [error] GenServer :exometer_report terminating
** (ArgumentError) argument error
    :erlang.send_after(-996, #PID<0.961.0>, {:report_batch, :exometer_report_statsd, :process_report_interval, 1000, {1444, 47160, 2116602}})
    (exometer_core) src/exometer_report.erl:1218: :exometer_report.restart_batch_timer/3
    (exometer_core) src/exometer_report.erl:1060: :exometer_report.handle_info/2
    (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:681: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:report_batch, :exometer_report_statsd, :process_report_interval, 1000, {1444, 47160, 1116602}}
State: {:st, [], []}
pid=<0.961.0>

@tolbrino
Copy link
Contributor

tolbrino commented Dec 4, 2015

@alco Could you create a test case which reproduces that error?

@tolbrino tolbrino added the bug label Dec 29, 2015
@chemeris
Copy link

Any updates on this?
I see the same issue with https://github.com/fairwaves/exometer_influxdb
If I send my laptop to sleep while an app is running, Exometer starts generating tons of reports to catch up when I wake up my laptop.

@chemeris
Copy link

chemeris commented Aug 21, 2016

And I can also confirm, that the change from @uwiger crashes with this error right after start before any metrics are reported (I've merged the change onto the latest master).

23:09:37.348 [error] GenServer :exometer_report terminating
** (ArgumentError) argument error
    :erlang.send_after(-998, #PID<0.484.0>, {:report_batch, :exometer_report_lager, :global_report_interval, 1000, {1471, 810176, 2342341}})
    (exometer_core) src/exometer_report.erl:1244: :exometer_report.restart_batch_timer/3
    (exometer_core) src/exometer_report.erl:1086: :exometer_report.handle_info/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:report_batch, :exometer_report_lager, :global_report_interval, 1000, {1471, 810176, 1342341}}
State: {:st, [], []}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants