Fluent-bit: Fluent-bit never recycles TCP connections when using the forward plugin

Created on 2 Dec 2020  路  3Comments  路  Source: fluent/fluent-bit

Bug Report

Fluent-bit never recycles TCP connections when using the forward plugin. When using fluent-bit to forward logs to fluentd, with keepalive enabled, fluent-bit never recycles a connection.

I've determined that Fluent-bit always marks the connection as failed when it tries to reuse it so you never encounter log lines like [upstream] KA connection #22 to 127.0.0.1:24224 has been assigned (recycled) indicating connection reuse. Instead, you always see log lines like [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up instead.

See the Additional Context section for my thoughts on the problem.

To Reproduce
- Steps to reproduce the problem:
1. Run fluentd (I use docker) with docker run -v $(pwd)/fluent_conf fluent/fluentd:latest /usr/bin/fluentd -c /fluent_conf/minimal_fluentd.conf
2. Run fluent-bit with docker run -v $(pwd):/fluent_conf fluent/fluent-bit:1.6.6 /fluent-bit/src/fluentbit -c /fluent_conf/minimal_fluentbit.conf
3. Observe in the debug logs that the connection is never recycled.

(Config files below)

Expected behavior
Fluent-bit recycles connections and prints debug lines like [upstream] KA connection #22 to 127.0.0.1:24224 has been assigned (recycled) to indicate it has done so.

Your Environment

  • Version used: Fluent-bit 1.6.6
  • Environment name and version: Reproduced using fluent/fluent-bit:1.6.6 docker image, running Docker Desktop for Mac 2.5.0.1
  • Operating System and version: Reproduced using MacOS 10.15.7 running fluent-bit and fluentd in Docker Desktop. 2.5.0.1
  • Configuration:

Minimal fluent-bit config:

[SERVICE]
  Log_Level debug

[INPUT]
  Name dummy
  Rate 10

[OUTPUT]
  Name forward
  Match *
  Host 172.17.0.3 // <-- Replace this with your docker fluentd IP.
  Port 24224
  net.keepalive on
  net.keepalive_idle_timeout 10s

Minimal fluentd config to receive forwarded logs

<service>
  Log_Level debug
</service>
<source>
  @type forward
</source>
<match>
  @type stdout
</match>

Additional context


In production, we are using fluent-bit to send traffic to fluentd using TLS from multiple hosts. As fluent-bit is recreating the connection every flush period, fluentd is spending a lot of time in TLS negotiation and its throughput is suffering.

The problem appears to be at src/flb_upstream.c where flb_socket_error is checked. This corresponds to errno on Linux; As far as I can tell, other function calls on the same thread can overwrite this value between a socket error occurring and the value being checked here. In particular, it appears some of the msgpack functions used by the forward plugin set the errno which is why it shows up here, but it is likely a problem with all plugins that use an upstream. My C skills aren't good enough to figure out the best way to fix this, or I'd raise a PR.

Fluent-bit output

Fluent Bit v1.6.6
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/12/02 20:13:10] [ info] Configuration:
[2020/12/02 20:13:10] [ info]  flush time     | 5.000000 seconds
[2020/12/02 20:13:10] [ info]  grace          | 5 seconds
[2020/12/02 20:13:10] [ info]  daemon         | 0
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info]  inputs:
[2020/12/02 20:13:10] [ info]      dummy
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info]  filters:
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info]  outputs:
[2020/12/02 20:13:10] [ info]      forward.0
[2020/12/02 20:13:10] [ info] ___________
[2020/12/02 20:13:10] [ info]  collectors:
[2020/12/02 20:13:10] [ info] [engine] started (pid=1)
[2020/12/02 20:13:10] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/12/02 20:13:10] [debug] [storage] [cio stream] new stream registered: dummy.0
[2020/12/02 20:13:10] [ info] [storage] version=1.0.6, initializing...
[2020/12/02 20:13:10] [ info] [storage] in-memory
[2020/12/02 20:13:10] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/12/02 20:13:10] [debug] [forward:forward.0] created event channels: read=20 write=21
[2020/12/02 20:13:10] [debug] [router] match rule dummy.0:forward.0
[2020/12/02 20:13:10] [ info] [sp] stream processor started
[2020/12/02 20:13:14] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:14] [debug] [output:forward:forward.0] request 1118 bytes to flush
[2020/12/02 20:13:14] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:14] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
[2020/12/02 20:13:19] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:19] [debug] [output:forward:forward.0] request 1300 bytes to flush
[2020/12/02 20:13:19] [debug] [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up
[2020/12/02 20:13:19] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:19] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
[2020/12/02 20:13:24] [debug] [task] created task=0x7f5db926c0a0 id=0 OK
[2020/12/02 20:13:24] [debug] [output:forward:forward.0] request 1300 bytes to flush
[2020/12/02 20:13:24] [debug] [upstream] KA connection #27 is in a failed state to: 172.17.0.3:24224, cleaning up
[2020/12/02 20:13:24] [debug] [upstream] KA connection #27 to 172.17.0.3:24224 is now available
[2020/12/02 20:13:24] [debug] [task] destroy task=0x7f5db926c0a0 (task_id=0)
bug fixed

Most helpful comment

All 3 comments

I have a feeling this is related to the secondary issue I called out in here https://github.com/fluent/fluent-bit/issues/2826

As an attempt avoid this I have tried net.keepalive false in my config per this doc https://docs.fluentbit.io/manual/administration/networking#concepts but it does not seem to disable the feature as possibly noted as well in #2550. I also tried KeepAlive On as noted in #2189 but that too does not seem change the behavior of the error or the connection usage

Thanks for reporting this issue. I did more extensive research and found the buggy code. The fix is already in place on GIT master and will be part of the next v1.6.7 release: a6ccee48fca360c4f3a89f44efbe4cffa0a78471

Closing this ticket as fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

arienchen picture arienchen  路  3Comments

jcdauchy-moodys picture jcdauchy-moodys  路  3Comments

c0ze picture c0ze  路  3Comments

edsiper picture edsiper  路  4Comments

botzill picture botzill  路  4Comments