Fluent-bit never recycles TCP connections when using the forward plugin. When using fluent-bit to forward logs to fluentd, with keepalive enabled, fluent-bit never recycles a connection.
I've determined that Fluent-bit always marks the connection as failed when it tries to reuse it so you never encounter log lines like [upstream] KA connection #22 to 127.0.0.1:24224 has been assigned (recycled) indicating connection reuse. Instead, you always see log lines like [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up instead.
See the Additional Context section for my thoughts on the problem.
To Reproduce
- Steps to reproduce the problem:
1. Run fluentd (I use docker) with docker run -v $(pwd)/fluent_conf fluent/fluentd:latest /usr/bin/fluentd -c /fluent_conf/minimal_fluentd.conf
2. Run fluent-bit with docker run -v $(pwd):/fluent_conf fluent/fluent-bit:1.6.6 /fluent-bit/src/fluentbit -c /fluent_conf/minimal_fluentbit.conf
3. Observe in the debug logs that the connection is never recycled.
(Config files below)
Expected behavior
Fluent-bit recycles connections and prints debug lines like [upstream] KA connection #22 to 127.0.0.1:24224 has been assigned (recycled) to indicate it has done so.
Your Environment
Minimal fluent-bit config:
[SERVICE]
Log_Level debug
[INPUT]
Name dummy
Rate 10
[OUTPUT]
Name forward
Match *
Host 172.17.0.3 // <-- Replace this with your docker fluentd IP.
Port 24224
net.keepalive on
net.keepalive_idle_timeout 10s
Minimal fluentd config to receive forwarded logs
<service>
Log_Level debug
</service>
<source>
@type forward
</source>
<match>
@type stdout
</match>
Additional context
In production, we are using fluent-bit to send traffic to fluentd using TLS from multiple hosts. As fluent-bit is recreating the connection every flush period, fluentd is spending a lot of time in TLS negotiation and its throughput is suffering.
The problem appears to be at src/flb_upstream.c where flb_socket_error is checked. This corresponds to errno on Linux; As far as I can tell, other function calls on the same thread can overwrite this value between a socket error occurring and the value being checked here. In particular, it appears some of the msgpack functions used by the forward plugin set the errno which is why it shows up here, but it is likely a problem with all plugins that use an upstream. My C skills aren't good enough to figure out the best way to fix this, or I'd raise a PR.
Fluent-bit output
Fluent Bit v1.6.6
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/12/02 20:13:10] [ info] Configuration:
[2020/12/02 20:13:10] [ info] flush time | 5.000000 seconds
[2020/12/02 20:13:10] [ info] grace | 5 seconds
[2020/12/02 20:13:10] [ info] daemon | 0
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info] inputs:
[2020/12/02 20:13:10] [ info] dummy
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info] filters:
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info] outputs:
[2020/12/02 20:13:10] [ info] forward.0
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info] collectors:
[2020/12/02 20:13:10] [ info] [engine] started (pid=1)
[2020/12/02 20:13:10] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/12/02 20:13:10] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/12/02 20:13:10] [ info] [storage] version=1.0.6, initializing...
[2020/12/02 20:13:10] [ info] [storage] in-memory
[2020/12/02 20:13:10] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/12/02 20:13:10] [debug] [forward:forward.0] created event channels: read=20 write=21
[2020/12/02 20:13:10] [debug] [router] match rule dummy.0:forward.0
[2020/12/02 20:13:10] [ info] [sp] stream processor started
[2020/12/02 20:13:14] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:14] [debug] [output:forward:forward.0] request 1118 bytes to flush
[2020/12/02 20:13:14] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:14] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
[2020/12/02 20:13:19] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:19] [debug] [output:forward:forward.0] request 1300 bytes to flush
[2020/12/02 20:13:19] [debug] [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up
[2020/12/02 20:13:19] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:19] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
[2020/12/02 20:13:24] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:24] [debug] [output:forward:forward.0] request 1300 bytes to flush
[2020/12/02 20:13:24] [debug] [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up
[2020/12/02 20:13:24] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:24] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
I have a feeling this is related to the secondary issue I called out in here https://github.com/fluent/fluent-bit/issues/2826
As an attempt avoid this I have tried net.keepalive false in my config per this doc https://docs.fluentbit.io/manual/administration/networking#concepts but it does not seem to disable the feature as possibly noted as well in #2550. I also tried KeepAlive On as noted in #2189 but that too does not seem change the behavior of the error or the connection usage
Thanks for reporting this issue. I did more extensive research and found the buggy code. The fix is already in place on GIT master and will be part of the next v1.6.7 release: a6ccee48fca360c4f3a89f44efbe4cffa0a78471
Closing this ticket as fixed.
https://fluentbit.io/announcements/v1.6.7/
upgrade to v1.6.7
Most helpful comment
https://fluentbit.io/announcements/v1.6.7/
upgrade to v1.6.7