[out_splunk] SIGSEGV with specific chunks #8993

kiyutink · 2024-06-21T12:13:57Z

Bug Report

Describe the bug

We're running a set up where we run fluent-bit with a forward input and filesystem buffering. Periodically a chunk lands on the filesystem that upon read crashes fluent-bit. That means every time it restarts it crashes as it tries to replay this chunk from backlog. We haven't been able to reproduce this issue reliably (apart from trying to load it with the faulty chunk, in which case it crashes). I unfortunately can't share the chunk as it contains customer data.

Here's the log line we're seeing upon the crash:

[2024/06/21 10:16:25] [engine] caught signal (SIGSEGV)
#0  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#1  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#2  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#3  0x55ae116e4ad3      in  msgpack2json() at src/flb_pack.c:731
#4  0x55ae116e533a      in  flb_msgpack_to_json() at src/flb_pack.c:768
#5  0x55ae116e5457      in  flb_msgpack_raw_to_json_sds() at src/flb_pack.c:808
#6  0x55ae117e5bc3      in  splunk_format() at plugins/out_splunk/splunk.c:500
#7  0x55ae117e6424      in  cb_splunk_flush() at plugins/out_splunk/splunk.c:658
#8  0x55ae11c03ae6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0xffffffffffffffff  in  ???() at ???:0

Your Environment

Version used: 3.0.6
Configuration:

 [SERVICE]
        HTTP_Server On
        Health_Check On

        Storage.max_chunks_up 512
        Storage.backlog.mem_limit 100M

        Storage.path /var/log/flb-storage/

        Storage.sync normal
        Storage.metrics On
    [INPUT]
        Name Forward
        Storage.type filesystem

    [OUTPUT]
        Name Splunk
        Match *
        Host <our host>
        Port 443
        Splunk_Token ${SPLUNK_TOKEN}
        TLS On
        TLS.Verify On
        Event_index <index>
        Event_sourcetype fluentd

        Retry_Limit False
        Storage.total_limit_size 10GB

Environment name and version (e.g. Kubernetes? What version?): Kubernetes (but we can also reproduce locally if we offload the same chunk). We're using eks kubernetes v1.28
Server type and version:
Operating System and version:
Filters and plugins:

Additional context

We tried adjusting the chunk to determine which lines specifically in it might cause the crash, but were unable to manipulate the chunks in a way that would make them readable.

Please advise what we could try to assist with solving this somehow.

We have a strong suspicion that this appeared after we bumped to v3. the only somewhat relevant code change we found was here: https://github.com/fluent/fluent-bit/pull/8589/files

The text was updated successfully, but these errors were encountered:

patrick-stephens · 2024-06-21T13:34:57Z

Can you try the latest 3.0.7?

kiyutink · 2024-06-24T15:15:15Z

Can you try the latest 3.0.7?

Same behavior in 3.0.7
Same behavior in 3.0.0
Doesn't crash in 2.2.3

kiyutink added the status: waiting-for-triage label Jun 21, 2024

kiyutink mentioned this issue Jun 24, 2024

Build on macOS fails #9004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[out_splunk] SIGSEGV with specific chunks #8993

[out_splunk] SIGSEGV with specific chunks #8993

kiyutink commented Jun 21, 2024 •

edited

Loading

patrick-stephens commented Jun 21, 2024

kiyutink commented Jun 24, 2024

[out_splunk] SIGSEGV with specific chunks #8993

[out_splunk] SIGSEGV with specific chunks #8993

Comments

kiyutink commented Jun 21, 2024 • edited Loading

Bug Report

patrick-stephens commented Jun 21, 2024

kiyutink commented Jun 24, 2024

kiyutink commented Jun 21, 2024 •

edited

Loading