Fluent-bit: SIGSEGV while getting kubernetes metadata

Created on 27 Aug 2020 · 15Comments · Source: fluent/fluent-bit

Bug Report

Describe the bug

We're running fluent-bit as a daemonset on kubernetes 1.18.6. Using either fluent-bit 1.5.4 or 1.5.3 the pods on some of the nodes go into a crush loop. Interestingly, pods on other nodes function correctly.

Running 1.5.2 results in all pods running correctly without any restart.

While pods are restarting the following line appears in the api server logs:

http: TLS handshake error from <ip>:<port>: EOF

The fact that some of the pods work suggests the problem may not be in fluent-bit, but because 1.5.2 works perfectly, I thought it was worth asking here to see if you have any insight.

Logs of one of the failing pods:

Fluent Bit v1.5.3
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/08/26 14:13:36] [ info] Configuration:
[2020/08/26 14:13:36] [ info]  flush time     | 1.000000 seconds
[2020/08/26 14:13:36] [ info]  grace          | 5 seconds
[2020/08/26 14:13:36] [ info]  daemon         | 0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  inputs:
[2020/08/26 14:13:36] [ info]      tail
[2020/08/26 14:13:36] [ info]      systemd
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  filters:
[2020/08/26 14:13:36] [ info]      kubernetes.0
[2020/08/26 14:13:36] [ info]      parser.1
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  outputs:
[2020/08/26 14:13:36] [ info]      forward.0
[2020/08/26 14:13:36] [ info] ___________
[2020/08/26 14:13:36] [ info]  collectors:
[2020/08/26 14:13:36] [ info] [engine] started (pid=1)
[2020/08/26 14:13:36] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/08/26 14:13:36] [debug] [storage] [cio stream] new stream registered: systemd.1
[2020/08/26 14:13:36] [ info] [storage] version=1.0.5, initializing...
[2020/08/26 14:13:36] [ info] [storage] in-memory
[2020/08/26 14:13:36] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended file following on offset=127922
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] inode=516596 appended as /var/log/containers/xxxx.log
[2020/08/26 14:13:36] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/xxxx.log, inode 516596
.
.
.
[2020/08/26 14:13:37] [debug] [input:tail:tail.0] 1489 new files found on path '/var/log/containers/*.log'
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/08/26 14:13:37] [ info] [input:systemd:systemd.1] seek_cursor=s=56e3151719034948bee60d26cd7540f6;i=1cd... OK
[2020/08/26 14:13:37] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/08/26 14:13:37] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0  0x55d5033ac151      in  flb_io_net_connect() at src/flb_io.c:361
#1  0x55d5033969a9      in  create_conn() at src/flb_upstream.c:256
#2  0x55d503396d6f      in  flb_upstream_conn_get() at src/flb_upstream.c:375
#3  0x55d5034238cd      in  get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4  0x55d503425e84      in  flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5  0x55d503421370      in  cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6  0x55d5033853fc      in  flb_filter_init_all() at src/flb_filter.c:431
#7  0x55d50339098a      in  flb_engine_start() at src/flb_engine.c:457
#8  0x55d5033052f4      in  flb_main() at src/fluent-bit.c:1035
#9  0x55d503305342      in  main() at src/fluent-bit.c:1048
#10 0x7f628992209a      in  ???() at ???:0
#11 0x55d503302fd9      in  ???() at ???:0
#12 0xffffffffffffffff  in  ???() at ???:0

Source

osela

👍8

Most helpful comment

@edsiper Anything we should do to get this issue bumped ?

shinji62 on 4 Jan 2021

👍4

All 15 comments

would you please check if you can reproduce it with v1.5.6 ?

https://fluentbit.io/announcements/v1.5.6/

edsiper on 11 Sep 2020

@edsiper still failing with v1.5.6

[2020/09/13 06:46:59] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[engine] caught signal (SIGSEGV)
#0  0x557e17c77275      in  flb_io_net_connect() at src/flb_io.c:349

osela on 13 Sep 2020

@edsiper Here is the stacktrace when built with debug symbols and trace enabled.

I'm not an expert in C and linux sys calls, but looking at the code the line connection #1526 in process to kubernetes.default.svc:443 in the log suggests that a select sys call is later made with fd id 1526, while this page claims the maximum fd id supported by the call is 1024. Could this be the issue?

[2020/09/14 16:11:05] [debug] [input:tail:tail.0] 1491 new files found on path '/var/log/containers/*.log'
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] add filter: _SYSTEMD_UNIT=kubelet.service (or)
[2020/09/14 16:11:05] [ info] [input:systemd:systemd.1] seek_cursor=s=19e9e15e02524350a646c30ce43bc284;i=1c2... OK
[2020/09/14 16:11:05] [debug] [input:systemd:systemd.1] sd_journal library may truncate values to sd_journal_get_data_threshold() bytes: 65536
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2020/09/14 16:11:05] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2020/09/14 16:11:05] [trace] [upstream] get new connection for kubernetes.default.svc:443, net setup:
net.connect_timeout        = 10 seconds
net.source_address         = any
net.keepalive              = enabled
net.keepalive_idle_timeout = 30 seconds
[2020/09/14 16:11:05] [trace] [io] connection #1526 in process to kubernetes.default.svc:443
[engine] caught signal (SIGSEGV)
#0  0x55ad96a51a00      in  flb_io_net_connect() at src/flb_io.c:360
#1  0x55ad96a3bc5b      in  create_conn() at src/flb_upstream.c:266
#2  0x55ad96a3c130      in  flb_upstream_conn_get() at src/flb_upstream.c:390
#3  0x55ad96acd8e6      in  get_api_server_info() at plugins/filter_kubernetes/kube_meta.c:213
#4  0x55ad96acfe9d      in  flb_kube_meta_init() at plugins/filter_kubernetes/kube_meta.c:1000
#5  0x55ad96acb389      in  cb_kube_init() at plugins/filter_kubernetes/kubernetes.c:199
#6  0x55ad96a2a2cf      in  flb_filter_init_all() at src/flb_filter.c:431
#7  0x55ad96a35a8d      in  flb_engine_start() at src/flb_engine.c:458
#8  0x55ad969aa130      in  flb_main() at src/fluent-bit.c:1035
#9  0x55ad969aa17e      in  main() at src/fluent-bit.c:1048
#10 0x7faacd71e09a      in  ???() at ???:0
#11 0x55ad969a7e09      in  ???() at ???:0
#12 0xffffffffffffffff  in  ???() at ???:0

osela on 14 Sep 2020

@osela would you please enable tls.debug debug in your Kubernetes filter ? I want to know if is there any exception happening before the crash:

[FILTER]
    name       kubernetes
    ...
    tls.debug  4

edsiper on 14 Sep 2020

@edsiper the log is the same with tls.debug enabled

osela on 14 Sep 2020

there is no extra debug messages before the crash ?

edsiper on 14 Sep 2020

nope

osela on 14 Sep 2020

In this stacktrace, which specific Fluent Bit version did you ran ?

https://github.com/fluent/fluent-bit/issues/2508#issuecomment-691623682

edsiper on 14 Sep 2020

1.5.6

osela on 14 Sep 2020

@edsiper It seems like the kubernetes plugin crashes whenever there are more than 1024 containers (running or completed) on a node, so more than 1024 files in /var/log/containers. Is that possible?

osela on 17 Sep 2020

hmmm I will have to test that out.

edsiper on 17 Sep 2020

@edsiper just confirmed that.
Had a node with 1024+ files in /var/log/containers where fluent-bit was crashing. ssh-ed into the node and deleted all completed pod logs from /var/log/pods, which caused k8s to delete all the relevant symlinks in /var/log/containers. With less than 1024 files in /var/log/containers the fluent-bit pod started successfully.

osela on 21 Sep 2020

@edsiper have you had a chance to look into it?

osela on 18 Oct 2020

@edsiper We got exactly the same issue with 1.5.6, if /var/log/containers has more than 1024 file then the pod is crashing and keeping crashing until we remove some pod.

shinji62 on 22 Dec 2020

@edsiper Anything we should do to get this issue bumped ?

shinji62 on 4 Jan 2021

👍4

Was this page helpful?

0 / 5 - 0 ratings