Describe the bug
We're running fluent-bit as a daemonset on kubernetes 1.18.6. Using either fluent-bit 1.5.4 or 1.5.3 the pods on some of the nodes go into a crush loop. Interestingly, pods on other nodes function correctly.
Running 1.5.2 results in all pods running correctly without any restart.
While pods are restarting the following line appears in the api server logs:
http: TLS handshake error from <ip>:<port>: EOF
The fact that some of the pods work suggests the problem may not be in fluent-bit, but because 1.5.2 works perfectly, I thought it was worth asking here to see if you have any insight.
Logs of one of the failing pods:
Fluent Bit v1.5.3
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/08/26 14:13:36] [ info] Configuration:
[2020/08/26 14:13:36] [ info] flush time | 1.000000 seconds
[2020/08/26 14:13:36] [ info] grace | 5 seconds
[2020/08/26 14:13:36] [ info] daemon | 0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info] inputs:
[2020/08/26 14:13:36] [ info] tail
[2020/08/26 14:13:36] [ info] systemd
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info] filters:
[2020/08/26 14:13:36] [ info] kubernetes.0
[2020/08/26 14:13:36] [ info] parser.1
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info] outputs:
[2020/08/26 14:13:36] [ info] forward.0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info] collectors:
[2020/08/26 14:13:36] [ info] [engine] started (pid=1)
[2020/08/26 14:13:36] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: systemd.1
[2020/08/26 14:13:36] [ info] [storage] version=1.0.5, initializing...
[2020/08/26 14:13:36] [ info] [storage] in-memory
[2020/08/26 14:13:36] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended file following on offset=127922
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended as /var/log/containers/xxxx.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/xxxx.log, inode 516596
.
.
.
[2020/08/26 14:13:37] [debug] [input:tail:tail.0] 1489 new files found on path '/var/log/containers/*.log'
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/08/26 14:13:37] [ info] [input:systemd:systemd.1] seek_cursor=s=56e3151719034948bee60d26cd7540f6;i=1cd... OK
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0 0x55d5033ac151 in flb_io_net_connect() at src/flb_io.c:361
#1 0x55d5033969a9 in create_conn() at src/flb_upstream.c:256
#2 0x55d503396d6f in flb_upstream_conn_get() at src/flb_upstream.c:375
#3 0x55d5034238cd in get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4 0x55d503425e84 in flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5 0x55d503421370 in cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6 0x55d5033853fc in flb_filter_init_all() at src/flb_filter.c:431
#7 0x55d50339098a in flb_engine_start() at src/flb_engine.c:457
#8 0x55d5033052f4 in flb_main() at src/fluent-bit.c:1035
#9 0x55d503305342 in main() at src/fluent-bit.c:1048
#10 0x7f628992209a in ???() at ???:0
#11 0x55d503302fd9 in ???() at ???:0
#12 0xffffffffffffffff in ???() at ???:0
would you please check if you can reproduce it with v1.5.6 ?
@edsiper still failing with v1.5.6
[2020/09/13 06:46:59] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0 0x557e17c77275 in flb_io_net_connect() at src/flb_io.c:349
@edsiper Here is the stacktrace when built with debug symbols and trace enabled.
I'm not an expert in C and linux sys calls, but looking at the code the line connection #1526 in process to kubernetes.default.svc:443 in the log suggests that a select sys call is later made with fd id 1526, while this page claims the maximum fd id supported by the call is 1024. Could this be the issue?
[2020/09/14 16:11:05] [debug] [input:tail:tail.0] 1491 new files found on path '/var/log/containers/*.log'
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/09/14 16:11:05] [ info] [input:systemd:systemd.1] seek_cursor=s=19e9e15e02524350a646c30ce43bc284;i=1c2... OK
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2020/09/14 16:11:05] [trace] [upstream] get new connection for kubernetes.default.svc:443, net setup:
net.connect_timeout = 10 seconds
net.source_address = any
net.keepalive = enabled
net.keepalive_idle_timeout = 30 seconds
[2020/09/14 16:11:05] [trace] [io] connection #1526 in process to kubernetes.default.svc:443
[engine] caught signal (SIGSEGV)
#0 0x55ad96a51a00 in flb_io_net_connect() at src/flb_io.c:360
#1 0x55ad96a3bc5b in create_conn() at src/flb_upstream.c:266
#2 0x55ad96a3c130 in flb_upstream_conn_get() at src/flb_upstream.c:390
#3 0x55ad96acd8e6 in get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4 0x55ad96acfe9d in flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5 0x55ad96acb389 in cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6 0x55ad96a2a2cf in flb_filter_init_all() at src/flb_filter.c:431
#7 0x55ad96a35a8d in flb_engine_start() at src/flb_engine.c:458
#8 0x55ad969aa130 in flb_main() at src/fluent-bit.c:1035
#9 0x55ad969aa17e in main() at src/fluent-bit.c:1048
#10 0x7faacd71e09a in ???() at ???:0
#11 0x55ad969a7e09 in ???() at ???:0
#12 0xffffffffffffffff in ???() at ???:0
@osela would you please enable tls.debug debug in your Kubernetes filter ? I want to know if is there any exception happening before the crash:
[FILTER]
name kubernetes
...
tls.debug 4
@edsiper the log is the same with tls.debug enabled
there is no extra debug messages before the crash ?
nope
In this stacktrace, which specific Fluent Bit version did you ran ?
https://github.com/fluent/fluent-bit/issues/2508#issuecomment-691623682
1.5.6
@edsiper It seems like the kubernetes plugin crashes whenever there are more than 1024 containers (running or completed) on a node, so more than 1024 files in /var/log/containers. Is that possible?
hmmm I will have to test that out.
@edsiper just confirmed that.
Had a node with 1024+ files in /var/log/containers where fluent-bit was crashing. ssh-ed into the node and deleted all completed pod logs from /var/log/pods, which caused k8s to delete all the relevant symlinks in /var/log/containers. With less than 1024 files in /var/log/containers the fluent-bit pod started successfully.
@edsiper have you had a chance to look into it?
@edsiper We got exactly the same issue with 1.5.6, if /var/log/containers has more than 1024 file then the pod is crashing and keeping crashing until we remove some pod.
@edsiper Anything we should do to get this issue bumped ?
Most helpful comment
@edsiper Anything we should do to get this issue bumped ?