Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[out_splunk] SIGSEGV with specific chunks #8993

Open
kiyutink opened this issue Jun 21, 2024 · 2 comments
Open

[out_splunk] SIGSEGV with specific chunks #8993

kiyutink opened this issue Jun 21, 2024 · 2 comments

Comments

@kiyutink
Copy link

kiyutink commented Jun 21, 2024

Bug Report

Describe the bug

We're running a set up where we run fluent-bit with a forward input and filesystem buffering. Periodically a chunk lands on the filesystem that upon read crashes fluent-bit. That means every time it restarts it crashes as it tries to replay this chunk from backlog. We haven't been able to reproduce this issue reliably (apart from trying to load it with the faulty chunk, in which case it crashes). I unfortunately can't share the chunk as it contains customer data.

Here's the log line we're seeing upon the crash:

[2024/06/21 10:16:25] [engine] caught signal (SIGSEGV)
#0  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#1  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#2  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#3  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#4  0x55ae116e533a      in  flb_msgpack_to_json() at src/flb_pack.c:768
#5  0x55ae116e5457      in  flb_msgpack_raw_to_json_sds() at src/flb_pack.c:808
#6  0x55ae117e5bc3      in  splunk_format() at plugins/out_splunk/splunk.c:500
#7  0x55ae117e6424      in  cb_splunk_flush() at plugins/out_splunk/splunk.c:658
#8  0x55ae11c03ae6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0xffffffffffffffff  in  ???() at ???:0

Your Environment

  • Version used: 3.0.6
  • Configuration:
 [SERVICE]
        HTTP_Server On
        Health_Check On

        Storage.max_chunks_up 512
        Storage.backlog.mem_limit 100M

        Storage.path /var/log/flb-storage/

        Storage.sync normal
        Storage.metrics On
    [INPUT]
        Name Forward
        Storage.type filesystem

    [OUTPUT]
        Name Splunk
        Match *
        Host <our host>
        Port 443
        Splunk_Token ${SPLUNK_TOKEN}
        TLS On
        TLS.Verify On
        Event_index <index>
        Event_sourcetype fluentd

        Retry_Limit False
        Storage.total_limit_size 10GB


  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes (but we can also reproduce locally if we offload the same chunk). We're using eks kubernetes v1.28
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context

We tried adjusting the chunk to determine which lines specifically in it might cause the crash, but were unable to manipulate the chunks in a way that would make them readable.

Please advise what we could try to assist with solving this somehow.

We have a strong suspicion that this appeared after we bumped to v3. the only somewhat relevant code change we found was here: https://github.com/fluent/fluent-bit/pull/8589/files

@patrick-stephens
Copy link
Contributor

Can you try the latest 3.0.7?

@kiyutink
Copy link
Author

Can you try the latest 3.0.7?

  • Same behavior in 3.0.7
  • Same behavior in 3.0.0
  • Doesn't crash in 2.2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants