Fluent-bit: 1.6.5 tail.0 plugin stuck in paused state

Created on 23 Nov 2020  路  10Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug
I upgraded yesterday to 1.6.5 and fluentbit stopped to transfer logs, the transfer just stops with "tail.0 paused (mem buf overlimit)" and never resumes. Also a restart did not continue the read process.

I have downgraded to 1.6.4, there it works and the tail.0 process continues.

To Reproduce
Start fluentbit with a lot of outstanding logs in the tail files.

Expected behavior
Continue reading files

Screenshots

Your Environment

  • Version used: 1.6.5
  • Configuration:
[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    info
    Parsers_File /fluent-bit/etc/parsers.conf
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

[INPUT]
    Name             tail
    Path             /var/log/containers/*.log
    Parser           docker
    Tag              kubernetes.*
    Refresh_Interval 5
    Mem_Buf_Limit    5MB
    Skip_Long_Lines  On
    DB               /tail-db/tail-containers-state.db
    DB.Sync          Normal

[FILTER]
    Name                kubernetes
    Match               kubernetes.*
    Kube_Tag_Prefix     kubernetes.var.log.containers.
    Kube_URL            https://kubernetes.default.svc:443
    tls.debug           4
    tls.verify          Off
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    # Do not parse the inner json
    Merge_Log           Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On

[FILTER]
    name           lua
    match          *
    script         filters.lua
    call           transform

[OUTPUT]
    Name          forward
    Match         *
    Host          10.x.x.x
    Port          24224

    Retry_Limit   False
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes, 1.5.3
  • Server type and version:
  • Operating System and version: Centos 7,
  • Filters and plugins: Lua filter to fix app label
bug fixed

Most helpful comment

Found the root cause of the problem and found the fix.

thinking about how to implement a test case to avoid this kind of issues

All 10 comments

I also see some bad, unwanted behavior in 1.6.5.

We use filesystem-based storage, and when using storage.total_limit_size (with some high value, like 30G) in the output, the buffer keeps increasing, even though the backend/output is clearly reachable. Not the case with 1.6.4.

I can confirm that there is some changed behavior with 1.6.5; the container logs from the fluent-bit container are sent to the output, but no logs from other containers (tested with both stdout and with loki output plugin).

I also saw the "tail.0 paused (mem buf overlimit)" message once while testing, but could not reproduce it.

Configuration:

 [SERVICE]
    Parsers_File      parsers.conf
    Log_Level         info

  [INPUT]
    Name              tail
    Tag               kube.*
    Path              C:\\var\\log\\containers\\*.log
    Parser            docker
    DB                C:\\fluent-bit\\tail_docker.db
    Mem_Buf_Limit     7MB
    Refresh_Interval  10

  [FILTER]
    Name              kubernetes
    Match             kube.*
    Kube_URL          https://kubernetes.default.svc.cluster.local:443        
    Labels            off
    Merge_Log         on        

  [Output]
    name                    loki
    Match                   kube.*        
    host                    loki.loki.svc.cluster.local
    port                    3100
    tenant_id               ""        
    labels                  job=containerlogs

Environment:

  • Windows Server Core 2019 Container (mcr.microsoft.com/dotnet/framework/runtime:4.8-windowsservercore-ltsc2019)
  • Kubernetes 1.18.6

troubleshooting..

Found the root cause of the problem and found the fix.

thinking about how to implement a test case to avoid this kind of issues

Could emitter.1 paused (mem buf overlimit) be connected? Or is it a general issue on the output

We've got the same issue with v1.6.5 outputting to FLuentd and Loki.

It looks like rolling back to v1.6.4 gets us back running but we can't use Loki. Is there a ballpark ETA for a fix so I can figure out if I need to get Promtail deployed again to feed Loki?

FYI: v1.6.6 will be out in a few hours.

Thank you @edsiper as always!!

Please upgrade to v1.6.6:

https://fluentbit.io/announcements/v1.6.6/

Was this page helpful?
0 / 5 - 0 ratings