Fluent-bit: Error io after upgrading to fluent-bit 1.6.7 from 1.6.6: [filter:kubernetes:kubernetes.0] upstream connection error

Created on 3 Dec 2020  路  14Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug
After upgrading, got lot of error message about upstream connection error.

To Reproduce

  • Example log message if applicable:
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
[2020/12/03 10:17:36] [error] [io] connection #49 failed to: kubernetes.default.svc:443
[2020/12/03 10:17:36] [error] [filter:kubernetes:kubernetes.0] upstream connection error
  • Steps to reproduce the problem:

Expected behavior
No error

Your Environment

  • Version used: 1.6.7
  • Configuration:
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_Tag_Prefix     kube.var.log.containers.
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*
    Refresh_Interval  5
    Mem_Buf_Limit     6MB
    Skip_Long_Lines   On
    read_from_head    on
    DB                /tail-db/tail-containers-state.db
    DB.Sync           Off
    DB.locking        true
[OUTPUT]
    Name  es
    Match *
    Host  XXXX
    Port  443
    Logstash_Format On
    Retry_Limit 5
    Type  _doc
    Trace_Error true
    Time_Key @timestamp-flb
    Replace_Dots On

    HTTP_User XXX
    HTTP_Passwd XXXX

    tls on
    tls.verify on

  • Environment name and version (e.g. Kubernetes? What version?): 1.18
  • Server type and version: Amazon AMI Linux 2
  • Operating System and version:
  • Filters and plugins: input tail, output es
bug fixed

All 14 comments

I can confirm I am also seeing this issue on AWS EKS after upgrading to the version released a few hours ago. The URL is correct.

Environment name and version (e.g. Kubernetes? What version?): 1.17
Server type and version: Amazon AMI Linux 2
Operating System and version:
Filters and plugins: input tail, filter kubernetes, output http

Problem does not exist in 1.6.6.
Problem presents in 1.6.7.

    [FILTER]
      Name           record_modifier
      Match          *
      Record         cluster_name ${CLUSTER_NAME}

    [FILTER]
      Name                kubernetes
      Match               kube.*
      Kube_URL            https://kubernetes.default.svc.cluster.local:443
      Merge_Log           On
      Merge_Log_Trim      On
      Keep_Log            Off
      K8S-Logging.Parser  On
      K8S-Logging.Exclude Off  
      Labels              On
      Annotations         On
      Buffer_Size         1m

    [FILTER]
      Name    lua
      Match   kube.*
      script  /fluent-bit/etc/dedot.lua
      call    dedot

    [FILTER]
      Name modify
      Match kube.*
      Condition Key_exists kubernetes.labels.app
      Rename kubernetes.labels.app kubernetes.labels.app_name

EKS 1.18 can confirm. Release 1.6.7 is bugged.

Taking a look at this.

troubleshooting:

  • filter_kubernetes code is the same, no changes
  • socket error handling was modified, possible issue is here.

root cause of the problem;

  • regression that did not handle errno properly for connections in progress

Fixes:

  • 33be5a4d
  • 03daeebc

v1.6.8 is under release process.

Container images for v1.6.8 are already available, tags:

fluent/fluent-bit:1.6.8
fluent/fluent-bit:latest
fluent/fluent-bit:1.6
fluent/fluent-bit:1.6.8-debug
fluent/fluent-bit:1.6-debug

seeing a larger number of http healthcheck failures to / after updating to 1.6.8, could there be a socket leak from this?

ya seeing this shutdown intermittently with the following strace

[pid 20022] close(207)                  = 0
[pid 20022] epoll_ctl(8, EPOLL_CTL_DEL, 203, NULL) = 0
[pid 20022] write(203, "\25\3\3\0\32\0\0\0\0\0\0/\346\214P\205{\263\263\35\245\210s\236\334\5\333K[\2\7", 31) = 31
[pid 20022] close(203)                  = 0
[pid 20022] close(165)                  = 0
[pid 20022] close(166)                  = 0
[pid 20022] write(0, "\335\335\335\335\0\0\0\0", 8) = -1 EBADF (Bad file descriptor)
[pid 20022] write(2, "write: Bad file descriptor\n", 27) = 27
[pid 20022] close(181)                  = 0
[pid 20022] close(182)                  = 0
[pid 20022] write(11, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
[pid 20023] <... epoll_wait resumed> [{EPOLLIN, {u32=3070406656, u64=140006119354368}}], 16, -1) = 1
[pid 20022] <... write resumed> )       = 8
[pid 20023] read(10,  <unfinished ...>
[pid 20022] futex(0x7f55b6fff9d0, FUTEX_WAIT, 9, NULL <unfinished ...>
[pid 20023] <... read resumed> "\1\0\0\0\0\0\0\0", 8) = 8
[pid 20023] madvise(0x7f55b67ff000, 8335360, MADV_DONTNEED) = 0
[pid 20023] exit(0)                     = ?
[pid 20023] +++ exited with 0 +++
[pid 20022] <... futex resumed> )       = 0
[pid 20022] close(9)                    = 0
[pid 20022] close(10)                   = 0
[pid 20022] close(11)                   = 0
[pid 20022] close(18)                   = 0
[pid 20022] close(3)                    = 0
[pid 20022] close(4)                    = 0
[pid 20022] close(18)                   = -1 EBADF (Bad file descriptor)
[pid 20022] close(19)                   = 0
[pid 20022] close(6)                    = 0
[pid 20022] close(7)                    = 0
[pid 20022] close(169)                  = 0
[pid 20022] close(170)                  = 0
[pid 20022] close(171)                  = 0
[pid 20022] epoll_ctl(8, EPOLL_CTL_DEL, 167, NULL) = 0
[pid 20022] close(167)                  = 0
[pid 20022] close(184)                  = 0
[pid 20022] close(8)                    = 0
[pid 20022] madvise(0x7f55b8232000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b82b6000, 131072, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594b56000, 2101248, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55893d9000, 5251072, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55886d4000, 4988928, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55880c4000, 5668864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6511000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944f8000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b650d000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b648d000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b650f000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55945eb000, 331776, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5ae2000, 36864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5911000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58ec000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a1f000, 20480, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5947000, 16384, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5ab3000, 86016, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58f4000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6495000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5c1d000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a14000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58b2000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594d8e000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5930000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594e94000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594250000, 49152, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594aa1000, 49152, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55942a2000, 122880, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594dd7000, 73728, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55947f7000, 73728, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5aca000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5971000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58e3000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594e83000, 49152, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594768000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5b05000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55943c2000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594a90000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594e2a000, 98304, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594ac0000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a8c000, 110592, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55943cf000, 49152, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5982000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5908000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594b35000, 73728, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594b25000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594689000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559466a000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594a40000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594582000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b59cb000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a66000, 106496, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a85000, 20480, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944bd000, 20480, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b59bd000, 20480, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5927000, 28672, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559464c000, 98304, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559419b000, 200704, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5940000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5920000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b597a000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b596f000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5900000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58db000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5afd000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a02000, 61440, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6487000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6481000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5b0b000, 4096, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5942000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5c0c000, 20480, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b59e0000, 106496, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944e0000, 94208, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559491f000, 147456, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594865000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559473a000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944af000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5954000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559452e000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55949e0000, 98304, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594408000, 540672, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a25000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944d4000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594558000, 151552, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a3d000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b59d0000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55944c5000, 53248, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5ad3000, 57344, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559453c000, 94208, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594507000, 90112, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b595d000, 36864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5a49000, 61440, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b598e000, 36864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f559458b000, 212992, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b649f000, 446464, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55945cb000, 81920, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b63e6000, 12288, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b63ea000, 593920, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5b1d000, 593920, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58a9000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b585a000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b597e000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b586d000, 12288, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b587d000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5aee000, 36864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b59a5000, 49152, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5915000, 40960, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b7047000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b587a000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5876000, 8192, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b7052000, 32768, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5871000, 12288, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b702e000, 40960, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f5594497000, 86016, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b585d000, 61440, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6516000, 16384, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b582d000, 24576, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b61c0000, 753664, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b627e000, 1470464, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5b11000, 40960, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5834000, 143360, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5894000, 69632, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b58b9000, 110592, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5882000, 61440, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b7001000, 36864, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b6521000, 40960, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5baf000, 335872, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b700d000, 122880, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b55ff000, 2281472, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b705f000, 1708032, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b5c25000, 5873664, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b652f000, 2949120, MADV_DONTNEED) = 0
[pid 20022] madvise(0x7f55b75ff000, 8335360, MADV_DONTNEED) = 0
[pid 20022] exit(0)                     = ?
[pid 20022] +++ exited with 0 +++
<... futex resumed> )                   = 0
epoll_ctl(5, EPOLL_CTL_DEL, 6, NULL)    = -1 EBADF (Bad file descriptor)
close(5)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

@edsiper ^ this is causing some large stability issues

going to make a separate ticket for this

https://github.com/fluent/fluent-bit/issues/2830 created for the 1.6.8 issue describe above

We were running into comparable issues with the aws-cloudwatch plugin which gave us the following error:
Dec 10 08:11:11 ipc1 td-agent-bit[483]: [2020/12/10 08:11:11] [error] [io] connection #69 failed to: logs.eu-central-1.amazonaws.com:443
Hope it helps.

Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jcdauchy-moodys picture jcdauchy-moodys  路  3Comments

c0ze picture c0ze  路  3Comments

edsiper picture edsiper  路  4Comments

Markbnj picture Markbnj  路  4Comments

edsiper picture edsiper  路  4Comments