Describe the bug
in_tail plugin randomly fails with "too many open files" and errno=24 - unless switched from inotify to stat tail
To Reproduce
[TIMESTAMP] [ info] [storage] initializing...
[TIMESTAMP] [ info] [storage] in-memory
[TIMESTAMP] [ info] [storage] normal synchronization mode, checksum disabled
[TIMESTAMP] [ info] [engine] started (pid=14)
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.2
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.3
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.4
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.5
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.6
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.7
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.8
[TIMESTAMP] [error] [plugins/in_tail/tail_fs.c:168 errno=24] Too many open files
[TIMESTAMP] [error] Failed initialize input tail.9
[TIMESTAMP] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
/ # ulimit -S -n
1048576
/ # ulimit -H -n
1048576
and system max:
/ # cat /proc/sys/fs/file-max
13181250
seem fine.
But looking at https://github.com/fluent/fluent-bit/blob/v1.3.2/plugins/in_tail/tail_fs.c then at https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_fs_inotify.c#L171 and https://linux.die.net/man/2/inotify_init1 - as errno=24 in that context meaning EMFILE - suggests rather inotify-related issue. Forcing cmake to use _-DFLB_INOTIFY=Off_ flag and disabling inotify and using stat tail instead is a working workaround for now. It would beneficial to also have "inotify-disabled flavour" of Docker images of Fluent-Bit available, too.
Expected behavior
No errors when using in_tail plugin.
Screenshots
n/a
Your Environment
Additional context
Logs are not tailed, hence no log even processed by Fluent-Bit if that happens at given time.
Might be connected with #1712 and/or #1358
PR #1778 ready with workaround providing an additional flavour for Fluent-Bit Docker images (with inotify disabled and enabled)
@edsiper You might be interested with PR provided for that in #1778
I think the ideal solution to offer this is to provide an option to in_tail plugin to specify the notification mechanism either based on inotify (default) or stat.
I will add this issue to our 1.4 milestone.
@edsiper If that is possible to have it just in Fluent-Bit config file that would be even better!
I'm facing a similar issue in the systemd plugin:
Is it related? I've looked at ulimit output, /proc/sys/fs/file-nr and it all seems good...
```Dec 5 15:09:52 master systemd[1]: Started TD Agent Bit.
Dec 5 15:09:52 master td-agent-bit[3488]: #033[1mFluent Bit v1.3.3#033[0m
Dec 5 15:09:52 master td-agent-bit[3488]: #033[1m#033[93mCopyright (C) Treasure Data#033[0m
Dec 5 15:09:52 master td-agent-bit[3488]: [2019/12/05 15:09:52] [ info] [storage] initializing...
Dec 5 15:09:52 master td-agent-bit[3488]: [2019/12/05 15:09:52] [ info] [storage] in-memory
Dec 5 15:09:52 master td-agent-bit[3488]: [2019/12/05 15:09:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Dec 5 15:09:52 master td-agent-bit[3488]: [2019/12/05 15:09:52] [ info] [engine] started (pid=3488)
Dec 5 15:09:52 master td-agent-bit[3488]: [2019/12/05 15:09:52] [error] [plugins/in_systemd/systemd_config.c:60 errno=24] Too many open files
@aderuwe It might be more related to inotify mechanism - as by default (e.g. when using the default Docker image) this mechanism is used for tailing files in in_tail plugin:
https://linux.die.net/man/2/inotify_init1
Instead of looking at ulimit and available files, you might also check more specific inotify related settings such as:
/proc/sys/fs/inotify/max_user_watches
/proc/sys/fs/inotify/max_user_instances
maybe it works for you.
In my case in the end I have rebuilt the Docker image with _-DFLB_INOTIFY=Off_ option off, so that instead of using more performant _inofify_ mechanism, the plugin rather uses the more old-school _stat_ mechanism for tailing files - and it works for me for now as a workaround - see https://github.com/fluent/fluent-bit/pull/1778 - although it might have problems when using with symlinks probably.
The final solution is planned by @edsiper in this ticket to have a configuration option available in the upcoming https://github.com/fluent/fluent-bit/milestone/7 release.
Following up, quick question:
This issue was expected to be fixed in v1.4 according to Milestone 7: https://github.com/fluent/fluent-bit/milestone/7
Version 1.4.1 is out now but it looks like the fix is not included: https://fluentbit.io/announcements/v1.4.1/
Is there an estimated date when this issue might be fixed?
I'm currently having to kill Fluentbit daily to avoid the "disk space" issue, but wanted to make sure if I should wait for this to be closed.
Thx!
I haven't noticed that 1.4 (and 1.4.1) is already out - thanks @Helmut-Onna! @edsiper does it mean somewhere in some next 1.4.x patch version rather?
Instead of looking at ulimit and available files, you might also check more specific inotify related settings such as:
/proc/sys/fs/inotify/max_user_watches /proc/sys/fs/inotify/max_user_instances
Great hint! In my case /proc/sys/fs/inotify/max_user_instances was only 128.
The problem seems to be solved with setting the sysctl
fs.inotify.max_user_instances to 1500.
Is there any ETA for this fix? It seems it is still not in 1.4.5 ...
We deployed the fluent-bit as daemonsets in our kubernetes and faced the problem after two days running.
Not only fluent-bit reported: "too many open files", but all pods on that node reported the same error and even system processes failed outside of the containers.
After killing fluent-bit on node (nodeSelector to exclude it from running on selected node), node came back to normal state and error disappeared. So far we are afraid to run it, because of that error it can stop the whole node.
More details added: we have file limits for 5242880 and even this value was exhausted over two days.
I am doing some re-work on in_tail internals, we will defer this for v1.6
When you expect this to be released ? I see the workaround PR is closed(Dockerfile one) , but everyone who try to run 1.4 will lose his time looking around this issue.
I am facing this issue with and v1.5.x, and the latest documentation (presumably for 1.6.x) does not mention an option to use stat instead of inotify for in_tail, so I would like to bump this issue.
/cc @erain
Most helpful comment
When you expect this to be released ? I see the workaround PR is closed(Dockerfile one) , but everyone who try to run 1.4 will lose his time looking around this issue.