First of thank you for creating this project and for diligently working to make it better. I am trying to create a log pipeline for our metrics tracking using application logs. I installed fluent-bit on the app server which tails the logs and forwards it (via upstream) to a fluentd box for aggregation.
Describe the bug
When fluent bit is restarted (or stopped and started), logs are getting lost. I believe the logs that are in the memory buffer are getting dropped when the service is stopped. When the service resumes (restarts) the logs continue forwarding several documents later. By varying the "Mem_Buf_Limit" setting I was able to increase/decrease the number of logs lost by the restart.
To Reproduce
{ mssg: 'the value is ', i: 1000 }
{ mssg: 'the value is ', i: 1001 }
{ mssg: 'the value is ', i: 1002 }
{ mssg: 'the value is ', i: 1003 }
{ mssg: 'the value is ', i: 1004 }
{ mssg: 'the value is ', i: 1005 }
{ mssg: 'the value is ', i: 1006 }
{ mssg: 'the value is ', i: 1007 }
{ mssg: 'the value is ', i: 1008 }
Note: my script created 10M lines to test various features.
2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}
2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}
In this case 8290 lines were lost, this number increases as the Mem_buf_limit increases.
.
.
.
2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}
2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}
.
.
Expected behavior
It would be great if the logs are not dropped during restart. (may be the offset stored in the db should track the position of the log that was sent out and not just when it was read into the engine) During restart logs in the buffer should get flushed to the filesystem before shutdown and on startup first process these files before reading new logs.
Your Environment
[SERVICE]
Flush 1
Daemon Off
Log_Level trace
Log_File /var/log/td_bit.log
Parsers_File parsers.conf
Plugins_File plugins.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/tdbit_storage/
storage.sync full
storage.checksum off
storage.backlog.mem_limit 1M
[INPUT]
Name tail
Tag abhi_event_manager
Path /var/log/test_logs/test15.log
Db /var/log/td.db
Mem_Buf_Limit 1M
Parser json
Buffer_Chunk_Size 1k
Buffer_Max_Size 1k
storage.type memory
[FILTER]
Name record_modifier
Match *
Record td_host poc-abhi
[OUTPUT]
Name forward
Match abhi_event_manager
Upstream upstream.conf
Self_Hostname poc-abhi
Retry_Limit False
On the fluentD side, for testing I am just outputting the logs to a file: (below is the config)
<match abhi_event_manager>
@type file
path /var/log/abhi_evm
</match>
The goal is to be able to forward logs using fluent bit from the application servers to a centralized fluentD where we would perform aggregation on the log events and use it for metrics reporting.
So losing logs will lead to inaccurate metrics.
thanks for reaching out.
Note that your storage system in Tail section is memory. If you want to have full reliability across restarts you have to use filesystem storage.type .
When using filesystem it acts as a backup system, if is not enabled, you might face what you have just described.
I tried with filesystem as well, I lost the same number of logs
can you share your script so I can try to reproduce locally ?
test.log.gz
Sure, here you ago, Also appreciate your super quick response. (I zipped the input log file because it was fairly large, hopefully it should not be a problem).
let me know if you need any more info
hi @edsiper did you have any luck with the issue ?
Also, can you please remove the labels "fixed" and "not-an-issue"
troubleshooting
I also encountered this problem, I hope to give a configuration that does not lose logs when restarting, thank you
Hi, I'm experiencing the same issue. Please let me know if there is anything I can provide. Here is my configuration. The paths for /tail-db and /tmp/flb-storage are volumes mounted to the host os. My scenario would test when the fluentd collector is unreachable (via aws security group applied to restrict egress from the forwarder), and the fluent-bit forwarder is restarted. Any of the unflushed logs are lost.
[SERVICE]
Flush 1
Daemon Off
Log_Level info
Parsers_File parsers.conf
Parsers_File parsers_custom.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /tmp/flb-storage
storage.metrics on
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kubernetes.*
Refresh_Interval 5
Mem_Buf_Limit 5MB
Skip_Long_Lines On
DB /tail-db/tail-containers-state.db
DB.Sync Normal
storage.type filesystem
[INPUT]
Name tail
Path /var/log/kube-apiserver-audit.log
Parser docker
DB /var/log/audit.db
Tag log.lab.kubernetes.master.audit*
Refresh_Interval 5
Mem_Buf_Limit 35MB
Buffer_Chunk_Size 2MB
Buffer_Max_Size 10MB
Skip_Long_Lines true
Key kubernetes-audit
storage.type filesystem
Seems like a relatively big issue if buffers are broken. Are there any updates on this? Is fluentd recommended if we need buffers?
Most helpful comment
Seems like a relatively big issue if buffers are broken. Are there any updates on this? Is
fluentdrecommended if we need buffers?