Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector panics when Loki timestamp nanoseconds are unparseable #20763

Closed
jandersen-plaid opened this issue Jul 1, 2024 · 2 comments · Fixed by #20780
Closed

Vector panics when Loki timestamp nanoseconds are unparseable #20763

jandersen-plaid opened this issue Jul 1, 2024 · 2 comments · Fixed by #20780
Labels
sink: loki Anything `loki` sink related type: bug A code related bug.

Comments

@jandersen-plaid
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We currently run vector in the recommended configuration, with the daemon sending logs to the aggregator, which then forwards logs on to Loki.

Recently, we experienced an issue where a log was malformed in a way that made the timestamp unparseable by chrono. This log was stored in the buffer for sending to loki so, on every attempt to send data to Loki, vector would run into the corrupted log and panic. Specifically, this happens here:

let timestamp = match event.as_log().get_timestamp() {
Some(Value::Timestamp(ts)) => ts.timestamp_nanos_opt().expect("Timestamp out of range"),
_ => chrono::Utc::now()
.timestamp_nanos_opt()
.expect("Timestamp out of range"),
};

Which is derived from these logs:

2024-07-01T13:23:41.529426Z  INFO vector::app: Log level is enabled. level="info"
2024-07-01T13:23:41.530214Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
2024-07-01T13:23:41.532893Z  WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="vector_internal_logs" sink="loki"
2024-07-01T13:23:41.957917Z  INFO vector::topology::running: Running healthchecks.
2024-07-01T13:23:41.958031Z  INFO vector: Vector has started. debug="false" version="0.38.0" arch="x86_64" revision="ea0ec6f 2024-05-07 14:34:39.794027186"
2024-07-01T13:23:41.958146Z  INFO source{component_kind="source" component_id=vector_agents component_type=vector}: vector::sources::util::grpc: Building gRPC server. address=0.0.0.0:6000
2024-07-01T13:23:41.958214Z  INFO vector::topology::builder: Healthcheck passed.
2024-07-01T13:23:41.958337Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
thread 'vector-worker' panicked at src/sinks/loki/sink.rs:256:68:
Timestamp out of range
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-07-01T13:23:41.959052Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off graphql=http://127.0.0.1:8686/graphql
2024-07-01T13:23:41.959100Z ERROR sink{component_kind="sink" component_id=loki component_type=loki}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2024-07-01T13:23:41.959141Z  INFO vector: Vector has stopped.

The expected behavior here should be to log that the timestamp is unparseable (not panic), and then return None so that the log is filtered out from the filter_map and doesn't get sent on to Loki. This would be preferable behavior to panicking.

As an aside, I couldn't find any information about manually inspecting the buffer database. I am curious about what log was generated that is actually unparseable -- would be good to fix within our own systems so that vector doesn't have to panic or log a warning // error.

Mistakenly posted this before on the wrong user :|

Configuration

I can provide the config if needed, but I don't think it is relevant here.

Version

vector 0.38.0 (x86_64-unknown-linux-gnu ea0ec6f 2024-05-07 14:34:39.794027186)

Debug Output

No response

Example Data

No response

Additional Context

This is vector running in Kubernetes -- happy to grab additional info or debug logs, it will just take me a little bit.

Also happy to put up a PR, as the changes should be relatively straightforward in just returning None (after a log) from the function instead of panicking.

References

No response

@jandersen-plaid jandersen-plaid added the type: bug A code related bug. label Jul 1, 2024
@suikammd
Copy link
Contributor

suikammd commented Jul 2, 2024

meet same error, maybe should skip this event?
suikammd@ab38a76

@jszwedko jszwedko added the sink: loki Anything `loki` sink related label Jul 2, 2024
@jszwedko
Copy link
Member

jszwedko commented Jul 2, 2024

Thanks for opening this @jandersen-plaid !

Your patch looks roughly reasonable to me @suikammd . Would you want to open it as a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: loki Anything `loki` sink related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants