Describe the bug
It might be related to corrupted files in storage, or just a large backlog there. On start, fluentbit will try to load all the files, and crash soon afterwards with segfault.
To Reproduce
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581461952.525678621.flb
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581467643.662417426.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549865.134681941.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549865.134681941.flb
[engine] caught signal (SIGSEGV)
#0 0x5636a8cf8f0e in cio_file_st_get_meta_len() at lib/chunkio/include/chunkio/cio_file_st.h:72
#1 0x5636a8cf8f42 in cio_file_st_get_content() at lib/chunkio/include/chunkio/cio_file_st.h:93
#2 0x5636a8cf93f1 in cio_chunk_get_content() at lib/chunkio/src/cio_chunk.c:193
#3 0x5636a8a964f5 in flb_input_chunk_flush() at src/flb_input_chunk.c:550
#4 0x5636a8a7c823 in flb_engine_dispatch() at src/flb_engine_dispatch.c:146
#5 0x5636a8a79b22 in flb_engine_flush() at src/flb_engine.c:85
#6 0x5636a8a7b318 in flb_engine_handle_event() at src/flb_engine.c:247
#7 0x5636a8a7b318 in flb_engine_start() at src/flb_engine.c:489
#8 0x5636a89ec813 in main() at src/fluent-bit.c:853
#9 0x7f75cf5ecb96 in ???() at ???:0
#10 0x5636a89ea9d9 in ???() at ???:0
#11 0xffffffffffffffff in ???() at ???:0
Aborted (core dumped)
Expected behavior
No crashes
Screenshots
Your Environment
td-agent-bit.conf: |
[SERVICE]
# Rely on the supervisor service (eg kubelet) to restart
# the fluentbit daemon when the configuration changes.
Config_Watch on
# Given we run in a container stay in the foreground.
Daemon off
Flush 1
HTTP_Server on
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Log_Level warning
Parsers_File parsers.conf
storage.path /var/lib/fluentbit/storage/
storage.sync full
storage.checksum on
# from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
storage.backlog.mem_limit 100M
Problematic input (one of 3):
[INPUT]
Name tail
Tag kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Path /var/log/containers/*.log
Parser docker
DB /var/lib/fluentbit/input_tail_kube.db
Docker_Mode On
Mem_Buf_Limit 50MB
Buffer_Chunk_Size 1MB
Buffer_Max_Size 1MB
Skip_Long_Lines On
storage.type filesystem
Refresh_Interval 10
Environment name and version (e.g. Kubernetes? What version?):
k8s
Server type and version:
Operating System and version:
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
Filters and plugins:
Additional context
Ideally we would see no crashes. Seems after any crash, the storage files are corrupted and can't be loaded by subsequent runs.
@dgrala
thanks for reporting this issue.
Are you able to reproduce the problem __only__ with chunk tail.0/163-1581549864.900113196.flb ?
if so, can you provide me that chunk ? (can be in private by email, slack or another if required)
Thanks @edsiper for looking into this. I don't have that pod or file anymore. I can verify when it pops up again.
One thing I forgot to mention, on the node that was crashing consistently on start after loading 1.2 GB of storage backlog, we've increased storage.backlog.mem_limit from 100 MB to 2 GB and it wouldn't repro anymore. So there is some relation with available memory and files loaded from the storage, which might make it hard to repro with a single file.
Good news: I was able to reproduce the issue
Troubleshooting now
I am having a similar problem after enabling filesystem storage in 1.3.7
2020/02/14 15:07:30] [debug] [storage] [cio file] alloc_size from 36864 to 102400
[2020/02/14 15:07:30] [debug] [storage] [cio file] synced at: tail.4/1-1581692850.200312113.flb
[2020/02/14 15:07:30] [debug] [in_tail] file=/var/log/containers/purchase-event-manager-5-vmmb9_gvp_purchase-event-manager-02060e929d3e460da6ac351689cd2f419d6ba192cced026a458110b095d93812.log read=32689 lines=71
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14075faee0 (task_id=33)
[2020/02/14 15:07:30] [debug] [storage] tail.0:1-1581692755.636380503.flb mapped OK
[2020/02/14 15:07:30] [debug] [task] created task=0x7f14075faee0 id=33 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 143790 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 154 entries tag='kube.var.log.containers.message-manager-north-5-f4cvk_tid_message-manager-north-4f3d19b958f53e2dc6e59555ba887569d0153f63a399ef1ba8d80b61f873e316.log' tag_len=148
[2020/02/14 15:07:30] [debug] [task] created task=0x7f140743ef40 id=126 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 1239 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 1 entries tag='kube.var.log.containers.epg-agent-wait-2-46fc8_gvp_epg-agent-wait-e41122f15d432269e0156784949f781883d92669b0c5592300be5123a6060544.log' tag_len=134
[2020/02/14 15:07:30] [ warn] [task] retry for task 7 could not be re-scheduled
[2020/02/14 15:07:30] [debug] [retry] task retry=0x7f140740bae0, invalidated from the scheduler
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14074400c0 (task_id=7)
[engine] caught signal (SIGSEGV)
#0 0x5653af67bebb in __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1 0x5653af67bef2 in mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2 0x5653af67c779 in flb_sched_request_destroy() at src/flb_scheduler.c:314
#3 0x5653af67c8c7 in flb_sched_event_handler() at src/flb_scheduler.c:375
#4 0x5653af67a17a in flb_engine_start() at src/flb_engine.c:548
#5 0x5653af5e8cc6 in main() at src/fluent-bit.c:854
#6 0x7f14088b72e0 in ???() at ???:0
#7 0x5653af5e71a9 in ???() at ???:0
#8 0xffffffffffffffff in ???() at ???:0
@dgrala
I've added the proper fixes on 1.3 branch, would you please deploy and evaluate the following test image?
edsiper/fluent-bit-1.3.8-next:1
if you confirm is OK, I will proceed with a formal release.
note: that image is based in Ubuntu and heavier than the normal one.
@rmacian your issue looks different than this one, I've opened #1956 for this. Would you please follow up there ?
Thanks for a quick fix @edsiper!
Where can I get the custom release? Normally we use apt-get to install it from https://packages.fluentbit.io/ubuntu/bionic
Ah! I thought you were using containers, which distro?
On Fri, Feb 14, 2020, 18:02 Darek Grala notifications@github.com wrote:
Thanks for a quick fix @edsiper https://github.com/edsiper!
Where can I get the custom release? Normally we use apt-get to install it
from https://packages.fluentbit.io/ubuntu/bionic—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fluent/fluent-bit/issues/1950?email_source=notifications&email_token=AAC2INSTEBVJLTEZWM6FUFDRC4WIVA5CNFSM4KU4HKOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL22YJI#issuecomment-586525733,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAC2INVIFSAPY2A6FL3MNETRC4WIVANCNFSM4KU4HKOA
.
@dgrala
Please give it a try to the following TEST PACKAGE:
DISCLAIMER: THIS IS NOT AN OFFICIAL RELEASE!
@dgrala do you have any feedback ^ ?
Thanks @edsiper! We do run containers, but we use apt-get to build them. Thanks for the test package. Anything in particular we should watch out for? We don't have any decent test environment right now, so it will take us some time to set things up.
A new issue we're seeing today is td-agent-bit hanging making no progress https://github.com/fluent/fluent-bit/issues/1958 - not sure if related, we're still on the old version
Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.
Most helpful comment
Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.