Fluent-bit: Fluent-bit in K8s stops sending logs to elastic after a few hours

Created on 19 Jun 2018 · 19Comments · Source: fluent/fluent-bit

Fluentbit stops sending logs to Elasticsearch after several hours without telling me why. Fluent-bit docker is not crashing. It just stops doing what it shoud do. What could be the reason? I can't figure out why. It happens in every of the 3 cluster.

Situation: 3x (14 nodes k8s cluster (including 3 masters))
Version fluent-bit: 0.13.3
Kubernetes version: 1.8.4
Node OS: Red Hat 7.4 Maipo

Resource limits k8s:
"resources": { "limits": { "cpu": "100m", "memory": "1Gi" }, "requests": { "cpu": "100m", "memory": "512Mi" } }

Configuration:

[SERVICE]
Flush 1
Daemon Off
Log_Level debug

[INPUT]
Buffer_Chunk_Size 1MB
Buffer_Max_Size 25MB
DB /var/log/containers/fluent-bit.db
Exclude_Path kube-system.log
Mem_Buf_Limit 25MB
Name tail
Parser json
Path /var/log/containers/.log
Refresh_Interval 5
Skip_Long_Lines On
Tag kube.services.

[FILTER]
K8S-Logging.Parser On
K8S-Logging.exclude True
Kube_URL https://${KUBERNETES_SERVICE_HOST}:443
Match kube.*
Merge_JSON_Log On
Name kubernetes
tls.verify Off

[PARSER]
Decode_Field_As escaped log
Format json
Name json
Time_Format %Y-%m-%dT%H:%M:%S %z
Time_Key time

[OUTPUT]
Host xxx.xxx.xxx.xxx
Include_Tag_Key true
Logstash_DateFormat %G.%V
Logstash_Format On
Logstash_Prefix k8s-services-tst
Match kube.services.*
Name es
Port 9200
Retry_Limit False

bug fixed

Source

marckamerbeek

👍1

Most helpful comment

we have a similar issue where fluent-bit stops tracking a few applications after a few hours. we use fluent-bit in eks, to push logs to stackdriver. on further investigation this is what i found out.

our setup (fluent-bit configs attached):

fluent-bit is set up to tail all files matching /var/log/conainers/*.log. kubernetes takes care of setting up symlinks in /var/log/containers/ and logs in /var/lib/docker/containers/ (through /var/log/pods/)
by default, aws eks sets up docker with this log-opts setting {"max-size":"10m","max-file":"10"} (in /etc/docker/daemon.json)

issues:

gaps in container logs: when logs for a container rotates too fast (either for fluent-bit to keep up or for kubernetes to update symlinks), fluent-bit is still tailing one of the rotated files (*.{1..9}). by the time fluent-bit gets around to use the current log file symlink for that container in /var/log/containers/, some rotated files in between are never tailed (because /var/log/containers only has symlinks pointing to the current log file for each container). i think this can also be confirmed by examining the sqlite db, where it keeps a rotated flag for every log file. in cases where the rotated file is .2 or higher, i think logs in between were missed.
container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).

# sudo lsof -nP -a -p $(pgrep -f fluent-bi[t]) | grep -F b6a4c1b82014
fluent-bi 16157 root   54r      REG  259,1 10000149  814921332 /var/lib/docker/containers/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a-json.log.9 (deleted)

workaround (planned):

change eks docker log-opts to {"max-size":"100m","max-file":"2"} to reduce the log file rotation rate.

fluent-bit-configmap.txt
fluent-bit-ds.txt

mgsh on 19 Mar 2019

👍3

All 19 comments

@marckamerbeek have you checked the logs from fluent-bit and elastisearch?

MohdAhmad on 23 Jun 2018

Yes...nothing in the logs. Nothing in elastic. Start building my own fluentbit with extra debug logging. See if that shows some light.

marckamerbeek on 23 Jun 2018

@marckamerbeek how's your investigation going?

We've encountered the same thing with our AWS elasticsearch cluster and fluent-bit stopping randomly from one node to the other. We recently upgraded to 0.13.x to fix memory usage issue and restarting fluent-bit pod but now we just keep having to restart them manually almost daily.

jpbelanger-mtl on 10 Jul 2018

Hi jpbelanger, Still no clue. There is nothing in de logs. They just stop logging without a warning. The fluent-bit pods don't crash just stop logging. Today I start making my own build of fluent-bit with some extra debug logging. My time is limited, but I have cronjob running every 2 hours to restart fluent-bit on 4 clusters. So that takes away the hurry for me. I'll keep you posted.

marckamerbeek on 11 Jul 2018

@marckamerbeek thanks for the update. We might just rollback to the previous release we were running (0.12.11). It was using a insane amount of memory but at least it was sending log and restarting by itself when it reached kube memory limit. Wonder why not more people are having that same problem.

jpbelanger-mtl on 11 Jul 2018

Thats not an option for me since version 0.13.x contains some critical functionality for kubernetes plugin. @edsiper Could you please shine a ligth on this problem. I have no clue where to start. I'm able to do some work on fluent-bit now and build my own release. But my C skills are still not that good (but learning...:-))

marckamerbeek on 11 Jul 2018

would you please provide some metrics snapshots using Prometheus ?, meaning: when you don't see any messages ingested into elasticsearch, how Fluent Bit metrics looks like ?

edsiper on 11 Jul 2018

We use Prometheus in our cluster although not in combination with fluent-bit. I was not aware of that possibility. I'll will instrument fluent-bit right away with prometheus. Give me sec...

marckamerbeek on 12 Jul 2018

Sorry, it took a while. My grafana skills are not optimal. This is the fluent-bit data in Grafana:

marckamerbeek on 17 Jul 2018

@edsiper Hi, I'm still restarting the fluent-bit pods every two hours. I provided the metrics. Do you see some unusual things?

marckamerbeek on 13 Sep 2018

I see a high number of retries, which could mean problems reaching
elaetocsearch or elaetocsearch errors. Do u see any anomally in elastic
logs?

On Wed, Sep 12, 2018, 22:36 marckamerbeek notifications@github.com wrote:

@edsiper https://github.com/edsiper Hi, I'm still restarting the
fluent-bit pods every two hours. I provided the metrics. Do you see some
unusual things?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fluent/fluent-bit/issues/649#issuecomment-420880520,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWkNiiN14lC9FRAKShDBdHZjtKRS8e8ks5uaeC7gaJpZM4UtfZq
.

edsiper on 13 Sep 2018

The last time I looked I saw some parsing errors, but I got rid of 90% I think. I'll check it again. I'll need to wait to get access again to the Elastic Cluster. I'll come back on this.

marckamerbeek on 13 Sep 2018

This sounds very similar to an issue I'm currently experiencing with fluent-bit 0.14.7.

We're running in AWS, on Elastic Beanstalk (shudder), starting up a fluent-bit container alongside our apps, and using it as a Docker logging driver. It works for most of the day, then suddenly around the same time each day, not even when it's really any more busy than usual, it just completely stops logging.

We send our logs to AWS Elasticsearch currently, and so do many of our other applications, yet there is only one application that is having problems, and we have other applications that are noisier than this application that has stopped logging occasionally.

One potential reason I've seen is the Docker log rotation. The first log file I see contains 0 errors. 6 lines into starting a new log file, I start seeing this error:

[2018/11/21 18:19:32] [  Error] timerfd_settime: Invalid argument, errno=22 at /tmp/fluent-bit-0.14.7/lib/monkey/mk_core/mk_event_epoll.c:190

This happens continuously from the point where the log rotates, until the instance is rebooted / terminated. I'm a little bit tempted at the moment to explore how to use fluentd instead, but the reason we moved from fluentd to fluent-bit was because of fluentd's slow startup time causing issues with Elasticbeanstalk starting our apps (i.e. Docker complains the logging driver is not ready and the app just reboots continuously).

seeruk on 22 Nov 2018

We had the same issue on our 1.12.2 cluster, in our case it was DNS related so the pods was not able to lookup the ES cluster seeing the same error. Not saying that it is the same issue in your case, but there is some general DNS issues at the moment see https://github.com/kubernetes/kubernetes/issues/56903

[2018/11/21 18:19:32] [  Error] timerfd_settime: Invalid argument, errno=22 at /tmp/fluent-bit-0.14.7/lib/monkey/mk_core/mk_event_epoll.c:190

lunarfs on 3 Dec 2018

we have a similar issue where fluent-bit stops tracking a few applications after a few hours. we use fluent-bit in eks, to push logs to stackdriver. on further investigation this is what i found out.

our setup (fluent-bit configs attached):

fluent-bit is set up to tail all files matching /var/log/conainers/*.log. kubernetes takes care of setting up symlinks in /var/log/containers/ and logs in /var/lib/docker/containers/ (through /var/log/pods/)
by default, aws eks sets up docker with this log-opts setting {"max-size":"10m","max-file":"10"} (in /etc/docker/daemon.json)

issues:

gaps in container logs: when logs for a container rotates too fast (either for fluent-bit to keep up or for kubernetes to update symlinks), fluent-bit is still tailing one of the rotated files (*.{1..9}). by the time fluent-bit gets around to use the current log file symlink for that container in /var/log/containers/, some rotated files in between are never tailed (because /var/log/containers only has symlinks pointing to the current log file for each container). i think this can also be confirmed by examining the sqlite db, where it keeps a rotated flag for every log file. in cases where the rotated file is .2 or higher, i think logs in between were missed.
container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).

# sudo lsof -nP -a -p $(pgrep -f fluent-bi[t]) | grep -F b6a4c1b82014
fluent-bi 16157 root   54r      REG  259,1 10000149  814921332 /var/lib/docker/containers/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a-json.log.9 (deleted)

workaround (planned):

change eks docker log-opts to {"max-size":"100m","max-file":"2"} to reduce the log file rotation rate.

fluent-bit-configmap.txt
fluent-bit-ds.txt

mgsh on 19 Mar 2019

👍3

container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i suspect docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (i don't know if it's a corner case because of inode reuse).

if it helps to investigate, it does seem to recover at some point many hours later (although i missed to capture the exact state of things at the point when it recovered).

mgsh on 19 Mar 2019

we have a similar issue where fluent-bit stops tracking a few applications after a few hours. we use fluent-bit in eks, to push logs to stackdriver. on further investigation this is what i found out.

our setup (fluent-bit configs attached):

fluent-bit is set up to tail all files matching /var/log/conainers/*.log. kubernetes takes care of setting up symlinks in /var/log/containers/ and logs in /var/lib/docker/containers/ (through /var/log/pods/)

by default, aws eks sets up docker with this log-opts setting {"max-size":"10m","max-file":"10"} (in /etc/docker/daemon.json)

issues:

gaps in container logs: when logs for a container rotates too fast (either for fluent-bit to keep up or for kubernetes to update symlinks), fluent-bit is still tailing one of the rotated files (*.{1..9}). by the time fluent-bit gets around to use the current log file symlink for that container in /var/log/containers/, some rotated files in between are never tailed (because /var/log/containers only has symlinks pointing to the current log file for each container). i think this can also be confirmed by examining the sqlite db, where it keeps a rotated flag for every log file. in cases where the rotated file is .2 or higher, i think logs in between were missed.

container logs stop after few hours: when i inspected the file handles held by fluent-bit for a container where it stopped tailing after a few hours, i noticed that it was holding on to a file handle for the last rotated file (.9) of that container. i _suspect_ docker's logging system deleted the .9 file before fluent-bit was able to finish tailing it and somehow fluent-bit is unable to recover from this for that particular container (_i don't know if it's a corner case because of inode reuse_).
# sudo lsof -nP -a -p $(pgrep -f fluent-bi[t]) | grep -F b6a4c1b82014
fluent-bi 16157 root   54r      REG  259,1 10000149  814921332 /var/lib/docker/containers/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a/b6a4c1b82014891feb521f973c2823f3d291595b5a28c04d437093425e61ba7a-json.log.9 (deleted)
workaround (planned):

change eks docker log-opts to {"max-size":"100m","max-file":"2"} to reduce the log file rotation rate.

fluent-bit-configmap.txt
fluent-bit-ds.txt