Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Version of Helm and Kubernetes:
Kubernetes v1.9.6
Helm v2.9.1
Which chart:
stable/fluentd-elasticsearch
What happened:
The Fluentd pod stopped sending logs to Elasticsearch, and stopped logging itself. It also stopped updating it's file buffer. There is almost no CPU usage reported in Prometheus. This happened shortly after the pod started up. There was already a file buffer left from the previous pod.
What you expected to happen:
I expected the liveness probe to fail and the container to be restarted
How to reproduce it (as minimally and precisely as possible):
Run fluentd with file buffers over 100MB, and delete the pod so a new one starts up.
Anything else we need to know:
We found that the liveness probe hangs, or takes very long to complete.
We found this in the kubelet logs: 1fa2fc59b7030b872da3dff852b5947dd0270452d8e3e in container c8bbe9de50f77d111420addce3805d20cb03d6434944ac28f30b71d892cef876 terminated but process still running!
Running a ps shows that there are multiple bash commands still running on the container.
We are running a custom image, based off gcr.io/google-containers/fluentd-elasticsearch:v2.3.1, the only changes are fluent-plugin-kubernetes_metadata_filter 2.1.2, concat and rewrite-tag-filter gems installed.
Seeing the same issue. 7/15 fluentd instances has this problem in our cluster and stopped logging and sending data to elasticsearch. Liveness probe is not triggering restart. When manually restarted they start sending logs again.
Same last log entry for the 7 of them. (And we are having some issues with elasticsearch as well):
$ kubectl get pods -n kube-system |grep fluentd-elasticsearch-fluentd-elasticsearch | awk '{print $1}' | xargs -L 1 kubectl logs -n kube-system --tail=1 | sort
2018-10-24 07:41:41 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 07:47:59 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 07:51:59 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 07:54:07 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 07:58:30 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 08:29:58 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
2018-10-24 09:24:20 +0000 [warn]: [elasticsearch] failed to write data into buffer by buffer overflow action=:block
...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
Is there any news on this issue? I am still experiencing with a fresh deployment of this chart onto kubernetes v1.12.3
same hear - liveness hangs on high load env
Most helpful comment
Is there any news on this issue? I am still experiencing with a fresh deployment of this chart onto kubernetes v1.12.3