Fluent-bit: [BUG] in_tail plugin isn't continue watch log file after logrotate was ran on k8s logs file.

Created on 12 Sep 2017 · 26Comments · Source: fluent/fluent-bit

I have run fluent-bit for k8s, but after run logrotate, in_tail is not watch log file, which has been rotated.
My fluentbit config:

[SERVICE]
  Flush        5
  Daemon       Off
  Log_Level    debug
  Parsers_File parsers.conf

[INPUT]
  Name          tail
  Tag           kube.*
  Path          /log/*_default_*.log
  DB            /log/kube.db
  Mem_Buf_Limit 5MB
[OUTPUT]
  Name stdout
  Match *

My logrotate config

/var/lib/docker/containers/*/*.log {
    daily
    rotate 1
    copytruncate
    compress
    delaycompress
    missingok
    notifempty
}

I thinks something was wrong after logs file has changed outside container, how I reproduce: I run a fluent-bit containers in docker, mount volume [current_folder]:/log

Append log inside containers:

# Run this command inside fluent-bit container
$ echo '{"log":"hello world","stream":"stdout","time":"2017-08-14T07:16:44.331313977Z"}' >> message_default_inside.log
# This work!

append log outside of container

# Run this command outside of fluent-bit container
$ echo '{"log":"hello world","stream":"stdout","time":"2017-08-14T07:16:44.331313977Z"}' >> message_default_outside.log
# This not work!

Trying start, restart sometime.

bug fixed

Source

duythinht

Most helpful comment

I challenge the similar behaviour. We discovered it's related to logrotate "copytruncate" option.

copytruncate

          Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one.  It can be used when some program cannot be told to
          close  its logfile and thus might continue writing (appending) to the previous log file forever.  Note that there is a very small time slice between copying the file and truncating it,
          so some logging data might be lost.  When this option is used, the create option will have no effect, as the old log file stays in place.

Fluent bit should recognize number of lines in file, and if that is < then the previous value, it should re-read the file from scratch + reset it's position (whatever to get un-blocked).

epcim on 5 Feb 2018

👍2

All 26 comments

note that when a third-party tool rotate a file Fluent Bit catch this event (which is a file rename), and what it does is to keep monitoring the rotated file for the next 5 seconds (Rotate_Wait option), after that is not longer monitored.

Now when a file is rotated, likely the original application that create the logs will re-create the file (same name), but in order to let Fluent Bit catch that file creation it needs to re-scan the path, this operation is handled by the Refresh_Interval option, by default it re-scan every 60 seconds, I suggest to keep this value low as 5 seconds.

If the issue mentioned do not address the problem explained above, please provide detailed steps to try to reproduce the problem.

edsiper on 27 Sep 2017

@duythinht is there any pending question/issue on your side ?

edsiper on 13 Oct 2017

I'm still troubleshoot this issue. Sometime tail keep working, sometime it's not working (after logrotate running)

duythinht on 14 Oct 2017

I'm also with same issue.
When my app rotates the file fluent-bit container provides this error plugins/in_tail/tail_file.c:688 errno=2
and the log stop being monitored and fluent-bit container gets frozen.
Trying today to change the refresh-interval as @edsiper mentioned and then i will provide feedback.

hdiass on 25 Oct 2017

Changed the refresh-interval didn't helped.. when file rotated fluent-bit didn't monitored it anymore, needed to restart the fluent container.

hdiass on 26 Oct 2017

@hdiass 0.12.7 has been released, please upgrade to that version and let us know if the issue persists.

edsiper on 2 Nov 2017

Hello @edsiper, i upgraded fluent-bit but even though same issue, when file rotates its read anymore by fluent-bit and stays in loop trying to read the file. i've turned on the debug log level to post here the behaviour, if it helps.

[2017/11/06 22:03:07] [debug] [task] destroy task=0x7fca0023c0e0 (task_id=0)
[2017/11/06 22:03:07] [debug] [dyntag tail.0] 0x7fca0028b120 destroy (tag=tail.0)
[2017/11/06 22:03:34] [debug] [in_tail] rotated: /some/directory/file.log -> /some/directory/file.log
[2017/11/06 22:03:34] [debug] [in_tail] removed /some/directory/file.log
[2017/11/06 22:03:36] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:36] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering
[2017/11/06 22:03:41] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:41] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:41] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:41] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering
[2017/11/06 22:03:46] [debug] [in_tail] append new file: /some/directory/file.log
[2017/11/06 22:03:46] [debug] [in_tail] add to scan queue /some/directory/file.log, offset=10487070
[2017/11/06 22:03:46] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:46] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering

hdiass on 7 Nov 2017

@hdiass what kind of rotation mode are you using, copytruncate ?

[2017/11/06 22:03:34] [debug] [in_tail] rotated: /some/directory/file.log -> /some/directory/file.log

why the rotated file have the same name ?

[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log promote to TAIL_EVENT
[2017/11/06 22:03:36] [debug] [in_tail] file=/some/directory/file.log cannot promote, unregistering

that means that a file was promoted for inotify but then it failed, mostly because it was deleted.

Looks like your file are being rotated faster than the refresh_interval, please set a refresh_interval of 5 seconds

edsiper on 7 Nov 2017

@edsiper, the application that i want to monitor handles the log file itself, not using logrotate from the system. What the app does for what i can see is create a "backup" file with the old log file and recreates a new log file with the same name. Ok i'll set the refresh interval for that value and test again,

Thanks a lot,

hdiass on 7 Nov 2017

@edsiper I was checking and i already had refresh interval option set on 5, so that will not help

you can find the the config file i'm using below.

[SERVICE]
    Flush          1
    Daemon         off
    Log_Level      debug
    #Log_File      /var/log/fluentbit.log
    Parsers_File   parsers.conf

[INPUT]
    Name              tail
    Path              /some/directory/file.log
    Parser            json-test
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     10MB
    Buffer_Chunk_Size 8k
    Buffer_Max_Size   32k
    Refresh_Interval  5


[OUTPUT]
    Name   es
    Match  *
    Host   some.elastic.search.com
    Port   1111
    Logstash_Format On
    Retry_Limit False
    Logstash_Prefix some-index-logs
    Index some-index-logs
    Time_Key _@timestamp

also maybe good for you to know, the timestamp between old file last log is really like miliseconds difference from the first timestamp on the new log file.

old log file last line time stamp : "@timestamp":"2017-11-06T22:03:06.198+00:00"
newly created log file first line: "@timestamp":"2017-11-06T22:03:34.274+00:00"

If you can somehow tell me what is the best config here to fluent-bit correcty follow the log after the rotation

hdiass on 7 Nov 2017

I challenge the similar behaviour. We discovered it's related to logrotate "copytruncate" option.

copytruncate

          Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one.  It can be used when some program cannot be told to
          close  its logfile and thus might continue writing (appending) to the previous log file forever.  Note that there is a very small time slice between copying the file and truncating it,
          so some logging data might be lost.  When this option is used, the create option will have no effect, as the old log file stays in place.

Fluent bit should recognize number of lines in file, and if that is < then the previous value, it should re-read the file from scratch + reset it's position (whatever to get un-blocked).

epcim on 5 Feb 2018

👍2

Can confirm the issue using Fluent-Bit v0.12.13. sqlite3 db keeps the counter even when the log file itself was logrotated ans reset to 0 bytes.

Only workaround I was able to come up with is not to use the DB option. 😒

rmoriz on 16 Feb 2018

Copytruncate mode is dangerous and should be avoided in this scenario, in general it leads to data loss.

Counting the number of lines is not a solution since that will mean: for every read(2) go to the beginning of the file and count the number of line breaks (\n). The other solution would be to check for the file size on every read using stat(2), again ..it will be performance killer and a constant pain. It's times better to use a different log rotation mode than copytruncate.

Note that also copytruncate is done by a third party tool, so there is high chances that truncation is done when the application is writing data to the file, there is no "sync". A workaround would be to let Docker handle rotation. There is relevant discussion on this topic on Kubernetes repo:

https://github.com/kubernetes/kubernetes/issues/38495

edsiper on 16 Feb 2018

We're using fluent-bit outside of kubernetes/docker. Usually "logrotate" is responsible for logrotation (Debian/Ubuntu). In some cases we're still using "remote_syslog2" which claims to handle this scenario https://github.com/papertrail/remote_syslog2#log-rotation-and-the-behavior-of-remote_syslog - maybe an inspiration?

rmoriz on 16 Feb 2018

@rmoriz thanks for the hints.

Actually the papertrail client does specifically the workaround mentioned above: "stat(2) the file when some 'write' operation was done":

https://github.com/papertrail/remote_syslog2/blob/master/vendor/github.com/papertrail/go-tail/follower/follower.go#L170

Note that the workaround will __only__ work if the tool that generated the original log file did not open the file using O_APPEND mode.

I will try an enhancement

edsiper on 16 Feb 2018

@rmoriz @epcim

I pushed some improvements on GIT master to handle file truncation. Would you please re-build and test ?

edsiper on 17 Feb 2018

👍1

Confirm 0.13 Dev, tested for a while and seems it really works with logrotate and the above options. No freezes yet.

epcim on 22 Feb 2018

Deployed + tested one week. looks good so far.

rmoriz on 1 Mar 2018

thanks everyone for helping on this issue.

Closing as Fixed.

edsiper on 1 Mar 2018

Awesome @edsiper 🏆

Will this be released in the 0.12.x line? It would be very helpful!

sandstrom on 5 Mar 2018

Released :)

http://fluentbit.io/announcements/v0.12.15/

edsiper on 6 Mar 2018

@edsiper Awesome, thanks! 💯

sandstrom on 6 Mar 2018

ping @edsiper

The official documentation here https://fluentbit.io/documentation/0.13/input/tail.html states:

Rotation with truncation (e.g. logrotate's copytruncate mode) is not supported.

Is the documentation outdated or is there still an issue with logrotate and copytruncate?

suckowbiz on 19 Jul 2018

Documentation needs to be updated, in the other side the note the following requirement:

Note that the workaround will only work if the tool that generated the original log file did not open the file using O_APPEND mode.

edsiper on 20 Jul 2018

@edsiper FYI the documentation (even for 1.0: https://docs.fluentbit.io/manual/input/tail) still mentions "Rotation with truncation (e.g. logrotate's copytruncate mode) is not supported."

Also, regarding your remark that it "will only work if the tool that generated the original log file did not open the file using O_APPEND mode": does that mean we can expect logs rotated through logrotate's copytruncate to work or not?