Fluent-bit: fluent bit restart losing/dropping logs

Created on 18 Jun 2020 · 10Comments · Source: fluent/fluent-bit

Bug Report

First of thank you for creating this project and for diligently working to make it better. I am trying to create a log pipeline for our metrics tracking using application logs. I installed fluent-bit on the app server which tails the logs and forwards it (via upstream) to a fluentd box for aggregation.

Describe the bug
When fluent bit is restarted (or stopped and started), logs are getting lost. I believe the logs that are in the memory buffer are getting dropped when the service is stopped. When the service resumes (restarts) the logs continue forwarding several documents later. By varying the "Mem_Buf_Limit" setting I was able to increase/decrease the number of logs lost by the restart.

To Reproduce

Create a list of json documents with a serially incremented variable.

{ mssg: 'the value is ', i: 1000 }
{ mssg: 'the value is ', i: 1001 }
{ mssg: 'the value is ', i: 1002 }
{ mssg: 'the value is ', i: 1003 }
{ mssg: 'the value is ', i: 1004 }
{ mssg: 'the value is ', i: 1005 }
{ mssg: 'the value is ', i: 1006 }
{ mssg: 'the value is ', i: 1007 }
{ mssg: 'the value is ', i: 1008 }

Note: my script created 10M lines to test various features.

Update the config to tail the file.
Start the fluent bit service.
After some time stop the service. (note the last document received by fluentD)

2020-06-18T16:09:30+00:00       abhi_event_manager      {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}

After a few seconds start the service again.
check the document after the stop, notice the log output has a gap in the series.

2020-06-18T16:10:43+00:00       abhi_event_manager      {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}

In this case 8290 lines were lost, this number increases as the Mem_buf_limit increases.

.
.
.
2020-06-18T16:09:30+00:00       abhi_event_manager      {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}
2020-06-18T16:10:43+00:00       abhi_event_manager      {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}
.
.

Expected behavior
It would be great if the logs are not dropped during restart. (may be the offset stored in the db should track the position of the log that was sent out and not just when it was read into the engine) During restart logs in the buffer should get flushed to the filesystem before shutdown and on startup first process these files before reading new logs.

Your Environment

Version used: 1.3.5 and 1.4.6
Configuration:
(Note: I tried change storage.type to memory and filesystem, changing Mem_Buf_Limit, Buffer_Chunk_Size,Buffer_Max_Size but nothing helped)

[SERVICE]

    Flush        1

    Daemon       Off

    Log_Level    trace
    Log_File     /var/log/td_bit.log

    Parsers_File parsers.conf
    Plugins_File plugins.conf

    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

    storage.path                         /var/log/tdbit_storage/
    storage.sync                         full
    storage.checksum                off
    storage.backlog.mem_limit 1M

[INPUT]
    Name              tail
    Tag               abhi_event_manager
    Path              /var/log/test_logs/test15.log
    Db                /var/log/td.db
    Mem_Buf_Limit     1M
    Parser            json
    Buffer_Chunk_Size 1k
    Buffer_Max_Size   1k
    storage.type  memory

[FILTER]
    Name record_modifier
    Match *
    Record td_host poc-abhi

[OUTPUT]
    Name          forward
    Match         abhi_event_manager
    Upstream      upstream.conf
    Self_Hostname poc-abhi
    Retry_Limit   False

Environment name and version (e.g. Kubernetes? What version?): Installed via rpm (1.3.5) and via build (1.4.6)
Server type and version: virtual machine on openstack
Operating System and version: Centos 7
Filters and plugins: Filter mentioned above, no plugins.

On the fluentD side, for testing I am just outputting the logs to a file: (below is the config)

<match abhi_event_manager>
 @type file
 path /var/log/abhi_evm
</match>

The goal is to be able to forward logs using fluent bit from the application servers to a centralized fluentD where we would perform aggregation on the log events and use it for metrics reporting.
So losing logs will lead to inaccurate metrics.

troubleshooting

Source

ajosep514

👍3 👀2

Most helpful comment

Seems like a relatively big issue if buffers are broken. Are there any updates on this? Is fluentd recommended if we need buffers?

AlexMorreale on 11 Aug 2020

👍4

All 10 comments

thanks for reaching out.

Note that your storage system in Tail section is memory. If you want to have full reliability across restarts you have to use filesystem storage.type .

When using filesystem it acts as a backup system, if is not enabled, you might face what you have just described.

edsiper on 18 Jun 2020

I tried with filesystem as well, I lost the same number of logs

ajosep514 on 18 Jun 2020

can you share your script so I can try to reproduce locally ?

edsiper on 18 Jun 2020

test.log.gz
Sure, here you ago, Also appreciate your super quick response. (I zipped the input log file because it was fairly large, hopefully it should not be a problem).
let me know if you need any more info

ajosep514 on 18 Jun 2020

hi @edsiper did you have any luck with the issue ?

ajosep514 on 19 Jun 2020

Also, can you please remove the labels "fixed" and "not-an-issue"

ajosep514 on 22 Jun 2020

👀2 👍2

troubleshooting

edsiper on 2 Jul 2020

I also encountered this problem, I hope to give a configuration that does not lose logs when restarting, thank you

Liudapeng on 2 Jul 2020

👍4

Hi, I'm experiencing the same issue. Please let me know if there is anything I can provide. Here is my configuration. The paths for /tail-db and /tmp/flb-storage are volumes mounted to the host os. My scenario would test when the fluentd collector is unreachable (via aws security group applied to restrict egress from the forwarder), and the fluent-bit forwarder is restarted. Any of the unflushed logs are lost.

[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    info
    Parsers_File parsers.conf
    Parsers_File parsers_custom.conf
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020
    storage.path    /tmp/flb-storage
    storage.metrics on
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kubernetes.*
    Refresh_Interval  5
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    DB                /tail-db/tail-containers-state.db
    DB.Sync           Normal
    storage.type    filesystem
[INPUT]
    Name              tail
    Path              /var/log/kube-apiserver-audit.log
    Parser            docker
    DB                /var/log/audit.db
    Tag               log.lab.kubernetes.master.audit*
    Refresh_Interval  5
    Mem_Buf_Limit     35MB
    Buffer_Chunk_Size 2MB
    Buffer_Max_Size   10MB
    Skip_Long_Lines   true
    Key               kubernetes-audit
    storage.type    filesystem