Fluent-bit: [storage layer] Fluent-bit crashes on high contention if corrupted chunks exists

Created on 13 Feb 2020 · 13Comments · Source: fluent/fluent-bit

Bug Report

Describe the bug
It might be related to corrupted files in storage, or just a large backlog there. On start, fluentbit will try to load all the files, and crash soon afterwards with segfault.

To Reproduce

[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581461952.525678621.flb
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581467643.662417426.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549865.134681941.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549865.134681941.flb
[engine] caught signal (SIGSEGV)
#0  0x5636a8cf8f0e      in  cio_file_st_get_meta_len() at lib/chunkio/include/chunkio/cio_file_st.h:72
#1  0x5636a8cf8f42      in  cio_file_st_get_content() at lib/chunkio/include/chunkio/cio_file_st.h:93
#2  0x5636a8cf93f1      in  cio_chunk_get_content() at lib/chunkio/src/cio_chunk.c:193
#3  0x5636a8a964f5      in  flb_input_chunk_flush() at src/flb_input_chunk.c:550
#4  0x5636a8a7c823      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:146
#5  0x5636a8a79b22      in  flb_engine_flush() at src/flb_engine.c:85
#6  0x5636a8a7b318      in  flb_engine_handle_event() at src/flb_engine.c:247
#7  0x5636a8a7b318      in  flb_engine_start() at src/flb_engine.c:489
#8  0x5636a89ec813      in  main() at src/fluent-bit.c:853
#9  0x7f75cf5ecb96      in  ???() at ???:0
#10 0x5636a89ea9d9      in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)

Steps to reproduce the problem:
One example we had 1.2GB files stuck in storage. It repro'ed consistently. I moved the files out of storage, ran again, it didn't crash. When I moved them back, it didn't seem to crash. Might be hard to repro, but seems to happen regularly in our prod.

Expected behavior
No crashes

Screenshots

Your Environment

Version used: 1.3.2
Configuration:

  td-agent-bit.conf: |
    [SERVICE]
        # Rely on the supervisor service (eg kubelet) to restart
        # the fluentbit daemon when the configuration changes.
        Config_Watch              on
        # Given we run in a container stay in the foreground.
        Daemon                    off
        Flush                     1
        HTTP_Server               on
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        Log_Level                 warning
        Parsers_File              parsers.conf
        storage.path              /var/lib/fluentbit/storage/
        storage.sync              full
        storage.checksum          on
        # from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
        storage.backlog.mem_limit 100M

Problematic input (one of 3):

[INPUT]
        Name              tail
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/lib/fluentbit/input_tail_kube.db
        Docker_Mode       On
        Mem_Buf_Limit     50MB
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        storage.type      filesystem
        Refresh_Interval  10

Environment name and version (e.g. Kubernetes? What version?):
k8s
Server type and version:
Operating System and version:
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
Filters and plugins:

Additional context
Ideally we would see no crashes. Seems after any crash, the storage files are corrupted and can't be loaded by subsequent runs.

bug fixed waiting-for-user

Source

dgrala

Most helpful comment

Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.

dgrala on 21 Feb 2020

👍2

All 13 comments

@dgrala

thanks for reporting this issue.

Are you able to reproduce the problem __only__ with chunk tail.0/163-1581549864.900113196.flb ?

if so, can you provide me that chunk ? (can be in private by email, slack or another if required)

edsiper on 13 Feb 2020

Thanks @edsiper for looking into this. I don't have that pod or file anymore. I can verify when it pops up again.

One thing I forgot to mention, on the node that was crashing consistently on start after loading 1.2 GB of storage backlog, we've increased storage.backlog.mem_limit from 100 MB to 2 GB and it wouldn't repro anymore. So there is some relation with available memory and files loaded from the storage, which might make it hard to repro with a single file.

dgrala on 13 Feb 2020

Good news: I was able to reproduce the issue

Troubleshooting now

edsiper on 14 Feb 2020

🎉2

I am having a similar problem after enabling filesystem storage in 1.3.7

2020/02/14 15:07:30] [debug] [storage] [cio file] alloc_size from 36864 to 102400
[2020/02/14 15:07:30] [debug] [storage] [cio file] synced at: tail.4/1-1581692850.200312113.flb
[2020/02/14 15:07:30] [debug] [in_tail] file=/var/log/containers/purchase-event-manager-5-vmmb9_gvp_purchase-event-manager-02060e929d3e460da6ac351689cd2f419d6ba192cced026a458110b095d93812.log read=32689 lines=71
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14075faee0 (task_id=33)
[2020/02/14 15:07:30] [debug] [storage] tail.0:1-1581692755.636380503.flb mapped OK
[2020/02/14 15:07:30] [debug] [task] created task=0x7f14075faee0 id=33 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 143790 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 154 entries tag='kube.var.log.containers.message-manager-north-5-f4cvk_tid_message-manager-north-4f3d19b958f53e2dc6e59555ba887569d0153f63a399ef1ba8d80b61f873e316.log' tag_len=148
[2020/02/14 15:07:30] [debug] [task] created task=0x7f140743ef40 id=126 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 1239 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 1 entries tag='kube.var.log.containers.epg-agent-wait-2-46fc8_gvp_epg-agent-wait-e41122f15d432269e0156784949f781883d92669b0c5592300be5123a6060544.log' tag_len=134
[2020/02/14 15:07:30] [ warn] [task] retry for task 7 could not be re-scheduled
[2020/02/14 15:07:30] [debug] [retry] task retry=0x7f140740bae0, invalidated from the scheduler
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14074400c0 (task_id=7)
[engine] caught signal (SIGSEGV)
#0  0x5653af67bebb      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x5653af67bef2      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x5653af67c779      in  flb_sched_request_destroy() at src/flb_scheduler.c:314
#3  0x5653af67c8c7      in  flb_sched_event_handler() at src/flb_scheduler.c:375
#4  0x5653af67a17a      in  flb_engine_start() at src/flb_engine.c:548
#5  0x5653af5e8cc6      in  main() at src/fluent-bit.c:854
#6  0x7f14088b72e0      in  ???() at ???:0
#7  0x5653af5e71a9      in  ???() at ???:0
#8  0xffffffffffffffff  in  ???() at ???:0

rmacian on 14 Feb 2020

@dgrala

I've added the proper fixes on 1.3 branch, would you please deploy and evaluate the following test image?

edsiper/fluent-bit-1.3.8-next:1

if you confirm is OK, I will proceed with a formal release.

note: that image is based in Ubuntu and heavier than the normal one.

edsiper on 15 Feb 2020

@rmacian your issue looks different than this one, I've opened #1956 for this. Would you please follow up there ?

edsiper on 15 Feb 2020

Thanks for a quick fix @edsiper!

Where can I get the custom release? Normally we use apt-get to install it from https://packages.fluentbit.io/ubuntu/bionic

dgrala on 15 Feb 2020

Ah! I thought you were using containers, which distro?

On Fri, Feb 14, 2020, 18:02 Darek Grala notifications@github.com wrote:

Thanks for a quick fix @edsiper https://github.com/edsiper!

Where can I get the custom release? Normally we use apt-get to install it
from https://packages.fluentbit.io/ubuntu/bionic

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fluent/fluent-bit/issues/1950?email_source=notifications&email_token=AAC2INSTEBVJLTEZWM6FUFDRC4WIVA5CNFSM4KU4HKOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL22YJI#issuecomment-586525733,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAC2INVIFSAPY2A6FL3MNETRC4WIVANCNFSM4KU4HKOA
.

edsiper on 15 Feb 2020

@dgrala

Please give it a try to the following TEST PACKAGE:

https://fluentbit.io/issues/1950/td-agent-bit_1.3.8_amd64.deb

DISCLAIMER: THIS IS NOT AN OFFICIAL RELEASE!

edsiper on 15 Feb 2020

@dgrala do you have any feedback ^ ?

edsiper on 17 Feb 2020

Thanks @edsiper! We do run containers, but we use apt-get to build them. Thanks for the test package. Anything in particular we should watch out for? We don't have any decent test environment right now, so it will take us some time to set things up.

dgrala on 18 Feb 2020

A new issue we're seeing today is td-agent-bit hanging making no progress https://github.com/fluent/fluent-bit/issues/1958 - not sure if related, we're still on the old version

dgrala on 19 Feb 2020

Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.

dgrala on 21 Feb 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings