fluent-bit tail plugin stops working on logs

Created on 30 Jan 2019  路  7Comments  路  Source: fluent/fluent-bit

Bug Report

Describe the bug
. I think this might be a problem with the memory buffer (pause and resume) part. Or the epoll system

To Reproduce
I cannot clearly reproduce, but the last line of the log is always

[2019/01/29 18:01:33] [debug] [input] tail.0 resume (mem buf overlimit)
ce71683ae407da0c0fe2f56f.log read=32541 lines=130
[2019/01/29 18:01:33] [debug] [in_tail] file=/var/log/containers/manager-c944c6d4-hrthj_kafka_kafka-manager-c5e797fad6b42e7891c23436da697f03e0c33f1114474428d61b3c2e7cfdd9d6.log read=32625 lines=169
[2019/01/29 18:01:33] [debug] [in_tail] file=/var/log/containers/metallb-speaker-pdlhx_base_speaker-582cd5bd286b1280581d76a7a7caf12cbc16321d347e289fedd4976151fd5aa9.log read=32615 lines=131
[2019/01/29 18:01:33] [debug] [input] tail.0 paused (mem buf overlimit)
[2019/01/29 18:01:33] [debug] [in_tail] file=/var/log/containers/nginx-ingress-controller-6b8d6586c6-vqpjp_base_nginx-ingress-controller-c7ea9759d5eebcdb48f197d17bc5684b119529218024dccad4536432581711cb.log read=32593 lines=120
[2019/01/29 18:01:33] [debug] [task] destroy task=0x7fddf5e59340 (task_id=35)
[2019/01/29 18:01:33] [debug] [input] tail.0 resume (mem buf overlimit)

this is the end of strace on the process, it just stops there

54248 08:04:51.360258 open("/var/log", O_RDONLY|O_CLOEXEC) = 140
54248 08:04:51.360316 fsync(140)        = 0
54248 08:04:51.372387 close(140)        = 0
54248 08:04:51.372446 lseek(137, 0, SEEK_SET) = 0
54248 08:04:51.372489 write(137, "\331\325\5\371 \241c\327\0\0\0\2", 12) = 12
54248 08:04:51.372559 fsync(137)        = 0
54248 08:04:51.377749 lseek(19, 0, SEEK_SET) = 0
54248 08:04:51.377812 write(19, "SQLite format 3\0\20\0\1\1\0@  \0\0<\356\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54248 08:04:51.378412 lseek(19, 20480, SEEK_SET) = 20480
54248 08:04:51.378465 write(19, "\r\7\215\0\27\0\214\0\17d\16\314\16/\r\226\f\272\v\370\vD\n\220\t\342\1\30\t!\10\234\10\30\0\214\1\314\6\360\6Q\5\254\5\17\4^\3\273\2f\3\17\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54248 08:04:51.379073 fsync(19)         = 0
54248 08:04:51.394646 close(137)        = 0
54248 08:04:51.394715 unlink("/var/log/flb_kube.db-journal") = 0
54248 08:04:51.394957 fcntl(19, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826, len=510}) = 0
54248 08:04:51.395090 fcntl(19, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824, len=2}) = 0
54248 08:04:51.395302 fcntl(19, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
54248 08:04:51.395437 epoll_wait(12, {{EPOLLIN, {u32=1099070784, u64=140699932720448}}, {EPOLLIN, {u32=1081812880, u64=140699915462544}}, {EPOLLIN, {u32=1099070936, u64=140699932720600}}}, 256, -1) = 
54248 08:04:51.395501 read(13, "\0\0\10\20\2\0\0\0", 8) = 8
54248 08:04:51.395558 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff20eba1c0) = -1 ENOTTY (Inappropriate ioctl for device)
54248 08:04:51.395621 write(7, "N\0\0\0\0\0\0\0[2019/01/30 08:04:51] [debug] [task] destroy task=0x7ff741859580 (task_id=32)\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54280 08:04:51.395822 <... epoll_wait resumed> {{EPOLLIN, {u32=1099534336, u64=140699933184000}}}, 16, -1) = 1
54248 08:04:51.395852 <... write resumed> ) = 1024
54280 08:04:51.395890 read(6,  <unfinished ...>
54248 08:04:51.395911 epoll_ctl(12, EPOLL_CTL_ADD, 15, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1099248848, u64=140699932898512}}) = 0
54280 08:04:51.395954 <... read resumed> "N\0\0\0\0\0\0\0[2019/01/30 08:04:51] [debug] [task] destroy task=0x7ff741859580 (task_id=32)\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54248 08:04:51.396100 epoll_ctl(12, EPOLL_CTL_ADD, 17, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1099256528, u64=140699932906192}}) = 0
54280 08:04:51.396140 write(2, "[2019/01/30 08:04:51] [debug] [task] destroy task=0x7ff741859580 (task_id=32)\n", 78 <unfinished ...>
54248 08:04:51.396170 epoll_ctl(12, EPOLL_CTL_ADD, 20, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1099245008, u64=140699932894672}}) = 0
54280 08:04:51.396212 <... write resumed> ) = 78
54248 08:04:51.396230 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
54280 08:04:51.396266 epoll_wait(3,  <unfinished ...>
54248 08:04:51.396284 <... ioctl resumed> , 0x7fff20eba190) = -1 ENOTTY (Inappropriate ioctl for device)
54248 08:04:51.396320 write(7, "H\0\0\0\0\0\0\0[2019/01/30 08:04:51] [debug] [input] tail.0 resume (mem buf overlimit)\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54280 08:04:51.396506 <... epoll_wait resumed> {{EPOLLIN, {u32=1099534336, u64=140699933184000}}}, 16, -1) = 1
54248 08:04:51.396533 <... write resumed> ) = 1024
54280 08:04:51.396552 read(6,  <unfinished ...>
54248 08:04:51.396569 read(136,  <unfinished ...>
54280 08:04:51.396587 <... read resumed> "H\0\0\0\0\0\0\0[2019/01/30 08:04:51] [debug] [input] tail.0 resume (mem buf overlimit)\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
54248 08:04:51.396732 <... read resumed> "\1\0\0\0\0\0\0\0", 8) = 8
54280 08:04:51.396759 write(2, "[2019/01/30 08:04:51] [debug] [input] tail.0 resume (mem buf overlimit)\n", 72 <unfinished ...>
54248 08:04:51.396787 write(145, "\206\0\0\0\0\0\0\0", 8 <unfinished ...>
54280 08:04:51.396834 <... write resumed> ) = 72
54248 08:04:51.396856 <... write resumed> ) = 8
54280 08:04:51.396877 epoll_wait(3,  <unfinished ...>
54248 08:04:51.396896 write(145, "\203\245input\201\246tail.0\202\247records\316\0\1B\363\245bytes\316\1J\321\207\246filter\200\246output\201\251forward.0\205\254proc_records\316\0\1:U\252proc_bytes\3

Expected behavior
Logs should be shipped

Your Environment
Configuration:

kubernetes cluster: v1.10.2 on coreos
Kernel: Linux k8s-node-4 4.14.48-coreos-r2
Docker container: fluent/fluent-bit:1.0.3

/var/lib/docker/containers is mounted on /var/lib/docker/containers
/var/log is mounted on /var/log

[INPUT]
    Name              tail
    Tag               kube.<namespace_name>.<pod_name>.<container_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     10MB
    Skip_Long_Lines   On
    Refresh_Interval  10

I disabled the kubernetes filter but it did not help

Additional context
I am currently evaluating fluent-bit, I may switch back to an older (0.x) version

fixed question

Most helpful comment

I am also facing the same, it ships logs for around 24 hrs and then stops. I have to manually restart the fluent bit pods on EKS. Do we have any update on this?

All 7 comments

Hi @gschaden, I'm having the same issue. Did you ever resolve this?

For me making the tail memory buffer much bigger (200MB instead of 20) seems to have fixed it. Of course at that rate it's using much more memory than many of the applications it's collecting logs from ...

Fluent Bit respects the configuration you impose. If you set 20MB of buffers it will try to adjust on that and pause data ingestion until it can flush that amount of data.

A workaround is to enable file system buffering so your input is not paused, but your data hits the file system.

Facing the same problem and setting this Mem_Buf_Limit to 32MB just makes pausing much less often.

Ran into this too

I am also facing the same, it ships logs for around 24 hrs and then stops. I have to manually restart the fluent bit pods on EKS. Do we have any update on this?

Any update on this? For me fluentbit stops sending logs after 30 min. Mem_Buf_Limit is at 5MB

Was this page helpful?
0 / 5 - 0 ratings