Symptoms:
I can see statistics from beat (no failures, some even rate), but logs do not appear on Elastic.
Setup:
Filebeat (+kubernetes +cloud) -> logstash -> elasticsearch
Cure:
Restart filebeat. Once restarted, logs fill in. (even some historic, not sure about everything or just some).
Discovery:
Filebeat follows files (checked in position file).
I've also attached log of failed filebeat (up to where it starts reading files)
somefilebeatlog.txt
I've found few things:
{"level":"error","timestamp":"2018-08-06T16:42:03.411Z","caller":"kubernetes/util.go:73","message":"kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/monitoring/pods/filebeat-notprod-5c9v7: dial tcp 100.64.0.1:443: i/o timeout)"}
and
{"level":"info","timestamp":"2018-08-06T16:41:33.409Z","caller":"add_cloud_metadata/add_cloud_metadata.go:297","message":"add_cloud_metadata: hosting provider type not detected."}
I did check on first one and parsing logs, I found another filebeat pod which is not sending data. Did not check on second message yet.
Few more error messages, which appear time by time in logs (not sure they are errors actually):
{"level":"error","timestamp":"2018-08-07T10:53:53.129Z","caller":"kubernetes/watcher.go:195","message":"kubernetes: Watching API error EOF"}
{"level":"info","timestamp":"2018-08-07T10:53:53.129Z","caller":"kubernetes/watcher.go:179","message":"kubernetes: Watching API for resource events"}
{"level":"warn","timestamp":"2018-08-06T16:43:40.418Z","caller":"transport/tcp.go:36","message":"DNS lookup failure \"elastic.xxxxxxxxxx\": lookup elastic.xxxxxxxxxx on 100.64.0.10:53: read udp 100.117.158.209:43142->100.64.0.10:53: i/o timeout"}
{"level":"warn","timestamp":"2018-08-06T16:43:44.551Z","caller":"transport/tcp.go:36","message":"DNS lookup failure \"log-ingest.xxxxxxxxxx\": lookup log-ingest.xxxxxxxxxx on 100.64.0.10:53: read udp 100.117.158.209:48035->100.64.0.10:53: i/o timeout"}
I'm not sure yet how to fix and what causes problem. Maybe if processors are defined and is not able to detect meta info (e.g. kubernetes, aws), then that shall be fatal error and application shall exit, so it can be restarted. Or maybe it retries several times before exit.
Problem, is that there is no health check to do or logs to see (as they are not coming) to pro-active detect issue and kill pod.
Here is sample metrics. There is two restarts, one I did to check, if restart helps, another was part of scripted restart. After first restarts, logs start coming in.
Please post your question on Discuss: https://discuss.elastic.co/
GH issues are reserved for bugs and enhancement requests.
This is bug.
It is not miss-configuration. I clearly stated problem I faced, provided some evidence and some ideas, why it is happening.
I hoped, it would help to identify problem and fix it.
Sorry. Reopened the issue.
@evilezh Can you add the version of Filebeat you are using this will help us narrowing the problem.
Latest version is installed -> 6.3.2.
We did some node reboots and seems that error then happens most likely.
Ok, I had a time to drill down to problem.
So, accounting all information, logs are sent properly, just without meta information and due to lack meta information, pipelines is not properly processing logs.
So - if meta information is not gathered at startup, then it never gets. Which leads to improperly tagged logs, which are not processed accordingly.
As meta-information is very important to sort out logs (e.g. from which server it comes, which application, etc.), I would rather fail filebeat (or have such option) if meta-information can't be retrieved. In absence of such, those logs means nothing. Also I can't re-ingest those logs later, as cursor already moved and logs are acked.
It does not explain one thing - how some historic logs re-appeared after I did restart. But at least something.
Btw, it seems problem is really after server restart/start. Probably filebeat pod get started before full network is available (calico, dns stuff).
As solution, probably would be great to have option to fail filebeat, if some processors not loaded properly or have health check option to see it, some api to ping regards to it.
I think I have this problem too. And the last time only a reboot of the node fixed it.
On machine startup meta-data can change, if some service is adding/updating/removing host meta-data during startup. E.g. some cloud-linux-distributions (on AWS for example) run a service named cloud-init. By default an image still has the host meta-data used while building the image. Only after cloud-init finished will these meta-data be updated.
Beats can not tell how accurate host meta-data are. The corret solution is to properly configure service startup order. If you can not control startup order, you could enhance the docker image with a small script to wait for a known 'signal' on startup.
Don't know if it's a bug or not. I don't think the problem lies with filebeat, but k8s pod startup order.
I agree with @kvch original answer of first posting questions/problems on discuss. And so does the greeting text when opening a new issue:
Please post all questions and issues on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.
Let me explain one more time.
This is bug. Or maybe behaviour which wouldn't be usually expected (normally compared to other software products, or what I would expect by default).
We all operate with assumption, that something might and will fail. In such case for production grade products (of course if beats are considered production grade), I do expect some behaviour and possible ways to repair.
In scheduler world (kubernetes, mesos) - we have extra components like CNI and we are running like software on software network, dns services etc. So there is a lot of variables and possibilities, that some components might fail or not work perfectly as expected.
In this particular case, BEAT starts up, fails on collecting meta-data. Log files without meta information (container, application etc.) is completely useless as impossible to tell which application did log that. I would expect by default filebeat exit on such errors. All those logs, which are without proper meta information are considered as lost, which is not acceptable in production environments.
As I did explain, I do run on k8s cluster. I have temporary solution in place, which checks few urls used by filebeat processors and fail if can't connect.
So, what I'm asking for - please fail beat by default in case if any of processors fail to work. Or add option to exit BEAT if processor failed.
Let's say - if I do not expect that processor run on system - I will not configure it. If I do - then I expect to work.
Hi everyone, and sorry for the late response.
add_kubernetes_metadata processor doesn't block the output, so if we didn't retrieve the right metadata by the moment the logs are read, we will ship them without it. This is sometimes common during startup.
To avoid this, you can use autodiscover as an alternative to static conf. As it will start reading logs from Kubernetes events, metadata is ensured to be present in all cases, a simple config like this would do:
filebeat.autodiscover:
providers:
- type: kubernetes
hints.enabled: true
Check this blog post for more info on hints based autodiscover: https://www.elastic.co/blog/docker-and-kubernetes-hints-based-autodiscover-with-beats
I agree, that since 6.3 it can be used as workaround. But it is just this single case. There is also cloud processor, there is also other beats.
I just ask to have an option to quit beat in case of processor failure (e.g. network io/dns failure) and do not process logs. That would be constant across all beats and I could have some kind of expected behaviour.
@evilezh Were you able to solve this problem?
I have made some progress and identified this may be related to performance issues with etcd.
After upgrading from etcd2 to etcd3, we are noticing that these errors have disappeared from our filebeat logs.
Unfortunately I also upgraded from filebeat 6.7.1 to 6.8.3 so its not totally clear if this may have had an impact.
Can anyone else with this problem share their etcd version to see if there is a relationship.
I'm running filebeat 7.9.0-SNAPSHOT (to work around another issue https://github.com/elastic/beats/issues/11834)
the config i use is:
filebeat.autodiscover:
providers:
- type: kubernetes
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/lib/docker/containers/${data.container.id}/*.logtags: ["my_cluster_name"] ignore_older: 4h processors: - add_fields: target: '' fields: k8s_cluster_name: "my_cluster_name"
And i still got
kubernetes: Querying for pod failed with error: Get "https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-4k8sd": dial tcp 100.64.0.1:443: i/o timeout
only in my case it caused the input from at least one pod to stop completely. Restart of filebeat seemed to help.
to anyone who may be intrested - i ended up making sure the API is available in the init-container before launching the actual filebeat:
initContainers: - name: init-k8s-api-checker image: curlimages/curl:7.72.0 command: ['sh', '-c', 'until $(curl https://100.64.0.1:443 -k -s -o /dev/null); do sleep 10; echo \"retrying\"; done; echo \"k8s api responded\"']
I'm still having this problem with Elastic 7.9.3 (in docker) and filebeat 7.9.3 in kubernetes. I am also running filebeat 7.9.3 in docker servers and I do get logs, but not from my kubernetes nodes. I've followed the configuration recommended by @ChrsMark but it is not working.
Here is my yaml:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: kube-system
labels:
k8s-app: filebeat
data:
filebeat.yml: |-
filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
templates:
- config:
- type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
processors:
- add_host_metadata:
- add_kubernetes_metadata:
monitoring:
enabled: true
output.elasticsearch:
hosts: '${ELASTICSEARCH_HOST}'
username: ${ELASTICSEARCH_USERNAME}
password: ${ELASTICSEARCH_PASSWORD}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
namespace: kube-system
labels:
k8s-app: filebeat
spec:
selector:
matchLabels:
k8s-app: filebeat
template:
metadata:
labels:
k8s-app: filebeat
spec:
serviceAccountName: filebeat
terminationGracePeriodSeconds: 30
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:7.9.3
args: [
"-c", "/etc/filebeat.yml",
"-e",
"-d", "autodiscover",
"-d", "kubernetes",
]
env:
- name: ELASTICSEARCH_HOST
value: "hostname"
- name: KIBANA_HOST
value: "hostname"
- name: ELASTICSEARCH_USERNAME
value: elastic
- name: ELASTICSEARCH_PASSWORD
value: changeme
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
runAsUser: 0
# If using Red Hat OpenShift uncomment this:
#privileged: true
resources:
limits:
memory: 400Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: config
mountPath: /etc/filebeat.yml
readOnly: true
subPath: filebeat.yml
- name: data
mountPath: /usr/share/filebeat/data
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0640
name: filebeat-config
- name: varlog
hostPath:
path: /var/log
# data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
- name: data
hostPath:
# When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
path: /var/lib/filebeat-data
type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: filebeat
subjects:
- kind: ServiceAccount
name: filebeat
namespace: kube-system
roleRef:
kind: ClusterRole
name: filebeat
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: filebeat
labels:
k8s-app: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
resources:
- namespaces
- pods
verbs:
- get
- watch
- list
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: filebeat
namespace: kube-system
labels:
k8s-app: filebeat
---
It seems to be finding the container/pod logfiles according to the yaml config, but I do see a strange line in the logs (2020-10-27T13:02:09.145Z DEBUG [autodiscover] template/config.go:156 Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml'):
2020-10-27T13:02:09.145Z DEBUG [autodiscover] autodiscover/autodiscover.go:195 Generated config: {
"paths": [
"/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log"
],
"type": "container"
}
2020-10-27T13:02:09.145Z DEBUG [autodiscover] autodiscover/autodiscover.go:259 Got a meta field in the event
2020-10-27T13:02:09.145Z DEBUG [autodiscover] template/config.go:156 Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml')
2020-10-27T13:02:09.146Z INFO log/input.go:157 Configured paths: [/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log]
2020-10-27T13:02:09.146Z DEBUG [autodiscover] cfgfile/list.go:63 Starting reload procedure, current runners: 0
2020-10-27T13:02:09.146Z DEBUG [autodiscover] cfgfile/list.go:81 Start list: 1, Stop list: 0
2020-10-27T13:02:09.146Z DEBUG [autodiscover] template/config.go:156 Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml')
2020-10-27T13:02:09.146Z INFO log/input.go:157 Configured paths: [/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log]
2020-10-27T13:02:09.146Z DEBUG [autodiscover] cfgfile/list.go:100 Starting runner: input [type=container]
2020-10-27T13:02:09.146Z DEBUG [autodiscover] autodiscover/autodiscover.go:174 Got a start event: map[config:[] host:10.209.72.50 id:20a7e6f1-df54-4c72-84de-d1c67895ba1b kubernetes:{"annotations":{"kubernetes":{"io/psp":"speaker"},"linkerd":{"io/inject":"enabled"},"prometheus":{"io/port":"7472","io/scrape":"true"}},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}} meta:{"kubernetes":{"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}}} ports:{"monitoring":7472} provider:1e5d4196-e211-4f96-af59-29cee8fd164b start:true]
2020-10-27T13:02:09.146Z DEBUG [autodiscover] autodiscover/autodiscover.go:259 Got a meta field in the event
2020-10-27T13:02:09.146Z DEBUG [autodiscover] autodiscover/autodiscover.go:174 Got a start event: map[config:[0xc0006328a0] host:10.209.72.50 id:20a7e6f1-df54-4c72-84de-d1c67895ba1b.speaker kubernetes:{"annotations":{"kubernetes":{"io/psp":"speaker"},"linkerd":{"io/inject":"enabled"},"prometheus":{"io/port":"7472","io/scrape":"true"}},"container":{"id":"e8a250df60c9334ba2c09917a586b1d3a0655f4799bade7cf0359485270151c1","image":"metallb/speaker:v0.8.2","name":"speaker","runtime":"docker"},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}} meta:{"container":{"id":"e8a250df60c9334ba2c09917a586b1d3a0655f4799bade7cf0359485270151c1","image":{"name":"metallb/speaker:v0.8.2"},"runtime":"docker"},"kubernetes":{"container":{"image":"metallb/speaker:v0.8.2","name":"speaker"},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}}} port:7472 provider:1e5d4196-e211-4f96-af59-29cee8fd164b start:true]
@sgreszcz answered at https://github.com/elastic/beats/issues/11834#issuecomment-717292632
@sgreszcz answered at #11834 (comment)
Confirm new manifest supplied works, looking at what the differences were: https://github.com/elastic/beats/issues/11834#issuecomment-717434316
Most helpful comment
Let me explain one more time.
This is bug. Or maybe behaviour which wouldn't be usually expected (normally compared to other software products, or what I would expect by default).
We all operate with assumption, that something might and will fail. In such case for production grade products (of course if beats are considered production grade), I do expect some behaviour and possible ways to repair.
In scheduler world (kubernetes, mesos) - we have extra components like CNI and we are running like software on software network, dns services etc. So there is a lot of variables and possibilities, that some components might fail or not work perfectly as expected.
In this particular case, BEAT starts up, fails on collecting meta-data. Log files without meta information (container, application etc.) is completely useless as impossible to tell which application did log that. I would expect by default filebeat exit on such errors. All those logs, which are without proper meta information are considered as lost, which is not acceptable in production environments.
As I did explain, I do run on k8s cluster. I have temporary solution in place, which checks few urls used by filebeat processors and fail if can't connect.
So, what I'm asking for - please fail beat by default in case if any of processors fail to work. Or add option to exit BEAT if processor failed.
Let's say - if I do not expect that processor run on system - I will not configure it. If I do - then I expect to work.