Vector panics when Loki timestamp nanoseconds are unparseable #20763

jandersen-plaid · 2024-07-01T15:00:44Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We currently run vector in the recommended configuration, with the daemon sending logs to the aggregator, which then forwards logs on to Loki.

Recently, we experienced an issue where a log was malformed in a way that made the timestamp unparseable by chrono. This log was stored in the buffer for sending to loki so, on every attempt to send data to Loki, vector would run into the corrupted log and panic. Specifically, this happens here:

vector/src/sinks/loki/sink.rs

Lines 253 to 258 in 3de6f0b

    
           let timestamp = match event.as_log().get_timestamp() { 
        
               Some(Value::Timestamp(ts)) => ts.timestamp_nanos_opt().expect("Timestamp out of range"), 
        
               _ => chrono::Utc::now() 
        
                   .timestamp_nanos_opt() 
        
                   .expect("Timestamp out of range"), 
        
           };

Which is derived from these logs:

2024-07-01T13:23:41.529426Z  INFO vector::app: Log level is enabled. level="info"
2024-07-01T13:23:41.530214Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2024-07-01T13:23:41.532893Z  WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="vector_internal_logs" sink="loki"
2024-07-01T13:23:41.957917Z  INFO vector::topology::running: Running healthchecks.
2024-07-01T13:23:41.958031Z  INFO vector: Vector has started. debug="false" version="0.38.0" arch="x86_64" revision="ea0ec6f 2024-05-07 14:34:39.794027186"
2024-07-01T13:23:41.958146Z  INFO source{component_kind="source" component_id=vector_agents component_type=vector}: vector::sources::util::grpc: Building gRPC server. address=0.0.0.0:6000
2024-07-01T13:23:41.958214Z  INFO vector::topology::builder: Healthcheck passed.
2024-07-01T13:23:41.958337Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
thread 'vector-worker' panicked at src/sinks/loki/sink.rs:256:68:
Timestamp out of range
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-07-01T13:23:41.959052Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off graphql=http://127.0.0.1:8686/graphql
2024-07-01T13:23:41.959100Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2024-07-01T13:23:41.959141Z  INFO vector: Vector has stopped.

The expected behavior here should be to log that the timestamp is unparseable (not panic), and then return None so that the log is filtered out from the filter_map and doesn't get sent on to Loki. This would be preferable behavior to panicking.

As an aside, I couldn't find any information about manually inspecting the buffer database. I am curious about what log was generated that is actually unparseable -- would be good to fix within our own systems so that vector doesn't have to panic or log a warning // error.

Mistakenly posted this before on the wrong user :|

Configuration

I can provide the config if needed, but I don't think it is relevant here.

Version

vector 0.38.0 (x86_64-unknown-linux-gnu ea0ec6f 2024-05-07 14:34:39.794027186)

Debug Output

No response

Example Data

No response

Additional Context

This is vector running in Kubernetes -- happy to grab additional info or debug logs, it will just take me a little bit.

Also happy to put up a PR, as the changes should be relatively straightforward in just returning None (after a log) from the function instead of panicking.

References

No response

The text was updated successfully, but these errors were encountered:

suikammd · 2024-07-02T15:54:24Z

meet same error, maybe should skip this event?
suikammd@ab38a76

jszwedko · 2024-07-02T20:08:39Z

Thanks for opening this @jandersen-plaid !

Your patch looks roughly reasonable to me @suikammd . Would you want to open it as a PR?

jandersen-plaid added the type: bug A code related bug. label Jul 1, 2024

jszwedko added the sink: loki Anything `loki` sink related label Jul 2, 2024

suikammd mentioned this issue Jul 3, 2024

fix(loki sinks): Fix loki event timestamp out of range panic #20780

Merged

jszwedko closed this as completed in #20780 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector panics when Loki timestamp nanoseconds are unparseable #20763

Vector panics when Loki timestamp nanoseconds are unparseable #20763

jandersen-plaid commented Jul 1, 2024

suikammd commented Jul 2, 2024 •

edited

Loading

jszwedko commented Jul 2, 2024

Vector panics when Loki timestamp nanoseconds are unparseable #20763

Vector panics when Loki timestamp nanoseconds are unparseable #20763

Comments

jandersen-plaid commented Jul 1, 2024

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

suikammd commented Jul 2, 2024 • edited Loading

jszwedko commented Jul 2, 2024

suikammd commented Jul 2, 2024 •

edited

Loading