Fluent-bit: SIGSEGV while getting kubernetes metadata

Created on 27 Aug 2020  路  15Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug

We're running fluent-bit as a daemonset on kubernetes 1.18.6. Using either fluent-bit 1.5.4 or 1.5.3 the pods on some of the nodes go into a crush loop. Interestingly, pods on other nodes function correctly.

Running 1.5.2 results in all pods running correctly without any restart.

While pods are restarting the following line appears in the api server logs:

http: TLS handshake error from <ip>:<port>: EOF

The fact that some of the pods work suggests the problem may not be in fluent-bit, but because 1.5.2 works perfectly, I thought it was worth asking here to see if you have any insight.

Logs of one of the failing pods:

Fluent Bit v1.5.3
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/26 14:13:36] [ info] Configuration:
[2020/08/26 14:13:36] [ info]  flush time     | 1.000000 seconds
[2020/08/26 14:13:36] [ info]  grace          | 5 seconds
[2020/08/26 14:13:36] [ info]  daemon         | 0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  inputs:
[2020/08/26 14:13:36] [ info]      tail
[2020/08/26 14:13:36] [ info]      systemd
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  filters:
[2020/08/26 14:13:36] [ info]      kubernetes.0
[2020/08/26 14:13:36] [ info]      parser.1
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  outputs:
[2020/08/26 14:13:36] [ info]      forward.0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  collectors:
[2020/08/26 14:13:36] [ info] [engine] started (pid=1)
[2020/08/26 14:13:36] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: systemd.1
[2020/08/26 14:13:36] [ info] [storage] version=1.0.5, initializing...
[2020/08/26 14:13:36] [ info] [storage] in-memory
[2020/08/26 14:13:36] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended file following on offset=127922
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended as /var/log/containers/xxxx.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/xxxx.log, inode 516596
.
.
.
[2020/08/26 14:13:37] [debug] [input:tail:tail.0] 1489 new files found on path '/var/log/containers/*.log'
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/08/26 14:13:37] [ info] [input:systemd:systemd.1] seek_cursor=s=56e3151719034948bee60d26cd7540f6;i=1cd... OK
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0  0x55d5033ac151      in  flb_io_net_connect() at src/flb_io.c:361
#1  0x55d5033969a9      in  create_conn() at src/flb_upstream.c:256
#2  0x55d503396d6f      in  flb_upstream_conn_get() at src/flb_upstream.c:375
#3  0x55d5034238cd      in  get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4  0x55d503425e84      in  flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5  0x55d503421370      in  cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6  0x55d5033853fc      in  flb_filter_init_all() at src/flb_filter.c:431
#7  0x55d50339098a      in  flb_engine_start() at src/flb_engine.c:457
#8  0x55d5033052f4      in  flb_main() at src/fluent-bit.c:1035
#9  0x55d503305342      in  main() at src/fluent-bit.c:1048
#10 0x7f628992209a      in  ???() at ???:0
#11 0x55d503302fd9      in  ???() at ???:0
#12 0xffffffffffffffff  in  ???() at ???:0

Most helpful comment

@edsiper Anything we should do to get this issue bumped ?

All 15 comments

would you please check if you can reproduce it with v1.5.6 ?

https://fluentbit.io/announcements/v1.5.6/

@edsiper still failing with v1.5.6

[2020/09/13 06:46:59] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0  0x557e17c77275      in  flb_io_net_connect() at src/flb_io.c:349

@edsiper Here is the stacktrace when built with debug symbols and trace enabled.

I'm not an expert in C and linux sys calls, but looking at the code the line connection #1526 in process to kubernetes.default.svc:443 in the log suggests that a select sys call is later made with fd id 1526, while this page claims the maximum fd id supported by the call is 1024. Could this be the issue?

[2020/09/14 16:11:05] [debug] [input:tail:tail.0] 1491 new files found on path '/var/log/containers/*.log'
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/09/14 16:11:05] [ info] [input:systemd:systemd.1] seek_cursor=s=19e9e15e02524350a646c30ce43bc284;i=1c2... OK
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2020/09/14 16:11:05] [trace] [upstream] get new connection for kubernetes.default.svc:443, net setup:
net.connect_timeout        = 10 seconds
net.source_address         = any
net.keepalive              = enabled
net.keepalive_idle_timeout = 30 seconds
[2020/09/14 16:11:05] [trace] [io] connection #1526 in process to kubernetes.default.svc:443
[engine] caught signal (SIGSEGV)
#0  0x55ad96a51a00      in  flb_io_net_connect() at src/flb_io.c:360
#1  0x55ad96a3bc5b      in  create_conn() at src/flb_upstream.c:266
#2  0x55ad96a3c130      in  flb_upstream_conn_get() at src/flb_upstream.c:390
#3  0x55ad96acd8e6      in  get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4  0x55ad96acfe9d      in  flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5  0x55ad96acb389      in  cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6  0x55ad96a2a2cf      in  flb_filter_init_all() at src/flb_filter.c:431
#7  0x55ad96a35a8d      in  flb_engine_start() at src/flb_engine.c:458
#8  0x55ad969aa130      in  flb_main() at src/fluent-bit.c:1035
#9  0x55ad969aa17e      in  main() at src/fluent-bit.c:1048
#10 0x7faacd71e09a      in  ???() at ???:0
#11 0x55ad969a7e09      in  ???() at ???:0
#12 0xffffffffffffffff  in  ???() at ???:0

@osela would you please enable tls.debug debug in your Kubernetes filter ? I want to know if is there any exception happening before the crash:

[FILTER]
    name       kubernetes
    ...
    tls.debug  4

@edsiper the log is the same with tls.debug enabled

there is no extra debug messages before the crash ?

nope

In this stacktrace, which specific Fluent Bit version did you ran ?

https://github.com/fluent/fluent-bit/issues/2508#issuecomment-691623682

1.5.6

@edsiper It seems like the kubernetes plugin crashes whenever there are more than 1024 containers (running or completed) on a node, so more than 1024 files in /var/log/containers. Is that possible?

hmmm I will have to test that out.

@edsiper just confirmed that.
Had a node with 1024+ files in /var/log/containers where fluent-bit was crashing. ssh-ed into the node and deleted all completed pod logs from /var/log/pods, which caused k8s to delete all the relevant symlinks in /var/log/containers. With less than 1024 files in /var/log/containers the fluent-bit pod started successfully.

@edsiper have you had a chance to look into it?

@edsiper We got exactly the same issue with 1.5.6, if /var/log/containers has more than 1024 file then the pod is crashing and keeping crashing until we remove some pod.

@edsiper Anything we should do to get this issue bumped ?

Was this page helpful?
0 / 5 - 0 ratings