exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

alco · 2015-06-29T10:28:26Z

We have a few reporters configured with the interval of 1000 ms using exometer_reporter_statsd. Whenever the system time jumps forward, the exometer_report_statsd can't keep up with the amount of incoming {exometer_report, ...} messages and, given enough metrics, the memory used by the VM will keep growing indefinitely.

The issue originates from this line where the second argument becomes negative after a forward time jump. The underlying issue is that the internal timestamp exometer_report is keeping can take a long time to catch up with the system's time because it is only increasing in multiples of the specified report interval here.

So if we have the report interval of 1000 ms and the time jumps forward by one hour, it will generate at least 3600 reports with interval 0 until it catches up. In a real app where we have quite a few metrics and we haven't seen it catch up at all when using exometer_report_statsd: the message queue would grow continuously meanwhile the Erlang process would be eating 200% CPU. When I log the value of Time - tdiff(T1, T0) to the console, it goes down a bit and then jumps back up, so it's kind of stuck in a loop.

Using exometer_report_tty instead does not result in that degenerate behaviour and it would catch up rather quickly back to the current time.

We would expect the exometer_report_statsd reporter to only send a single report in the case of the time jump because there is no useful info that can be collected between two reports which have interval 0 between them.

The text was updated successfully, but these errors were encountered:

uwiger · 2015-06-29T13:39:02Z

I will take a look. The purpose of the adjust_interval/2 function obviously isn't to reduce the interval to zero, but to compensate for drift due to scheduling delays etc. Adapting to clock changes probably requires also adjusting T0, which represents the calculated time when the report was supposed to be issued.

alco · 2015-09-25T13:08:46Z

Hi. Any updates on this front?

uwiger · 2015-09-25T15:30:50Z

No, apologies.

I noticed that I had an un-pushed branch where I had made some simple corrections to clock changes. I just pushed and created a PR (#44), but will say honestly that I haven't tested except verify that it doesn't break the (unmodified) test suite.

alco · 2015-10-05T12:18:48Z

Hi @uwiger. I have tried your branch and it seems to crash the reporter soon after application start.

I've added a few printf inside adjust_interval() like this:

adjust_interval(Time, T0) ->
    io:format("adjust_interval(~p, ~p)~n", [Time, T0]),
    T1 = os:timestamp(),
    case tdiff(T1, T0) of
        D when D > Time ->
            io:format("D = ~p, T1 = ~p~n", [D, T1]),
            %% Most likely due to clock adjustment
            {Time, T1};
        D ->
            io:format("D = ~p, T0 = ~p~n", [D, T0]),
            {D, T0}
    end.

and here's the output I see when starting my app (the error report is using Elixir syntax):

adjust_interval(1000, {1444,47160,1116602})
D = 1, T0 = {1444,47160,1116602}
adjust_interval(1000, {1444,47160,2116602})
D = -996, T0 = {1444,47160,2116602}

15:12:41.136 [error] GenServer :exometer_report terminating
** (ArgumentError) argument error
    :erlang.send_after(-996, #PID<0.961.0>, {:report_batch, :exometer_report_statsd, :process_report_interval, 1000, {1444, 47160, 2116602}})
    (exometer_core) src/exometer_report.erl:1218: :exometer_report.restart_batch_timer/3
    (exometer_core) src/exometer_report.erl:1060: :exometer_report.handle_info/2
    (stdlib) gen_server.erl:615: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:681: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:report_batch, :exometer_report_statsd, :process_report_interval, 1000, {1444, 47160, 1116602}}
State: {:st, [], []}
pid=<0.961.0>

tolbrino · 2015-12-04T10:46:48Z

@alco Could you create a test case which reproduces that error?

chemeris · 2016-08-21T19:57:49Z

Any updates on this?
I see the same issue with https://github.com/fairwaves/exometer_influxdb
If I send my laptop to sleep while an app is running, Exometer starts generating tons of reports to catch up when I wake up my laptop.

chemeris · 2016-08-21T20:12:38Z

And I can also confirm, that the change from @uwiger crashes with this error right after start before any metrics are reported (I've merged the change onto the latest master).

23:09:37.348 [error] GenServer :exometer_report terminating
** (ArgumentError) argument error
    :erlang.send_after(-998, #PID<0.484.0>, {:report_batch, :exometer_report_lager, :global_report_interval, 1000, {1471, 810176, 2342341}})
    (exometer_core) src/exometer_report.erl:1244: :exometer_report.restart_batch_timer/3
    (exometer_core) src/exometer_report.erl:1086: :exometer_report.handle_info/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:report_batch, :exometer_report_lager, :global_report_interval, 1000, {1471, 810176, 1342341}}
State: {:st, [], []}

uwiger mentioned this issue Sep 25, 2015

adjust_interval detects clock changes #44

Merged

tolbrino added the bug label Dec 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

alco commented Jun 29, 2015

uwiger commented Jun 29, 2015

alco commented Sep 25, 2015

uwiger commented Sep 25, 2015

alco commented Oct 5, 2015

tolbrino commented Dec 4, 2015

chemeris commented Aug 21, 2016

chemeris commented Aug 21, 2016 •

edited

Loading

exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

exometer_report_statsd becomes a CPU and memory hog when system time jumps forward #35

Comments

alco commented Jun 29, 2015

uwiger commented Jun 29, 2015

alco commented Sep 25, 2015

uwiger commented Sep 25, 2015

alco commented Oct 5, 2015

tolbrino commented Dec 4, 2015

chemeris commented Aug 21, 2016

chemeris commented Aug 21, 2016 • edited Loading

chemeris commented Aug 21, 2016 •

edited

Loading