Beats: Filebeat logs not coming

Created on 8 Aug 2018 · 18Comments · Source: elastic/beats

Symptoms:
I can see statistics from beat (no failures, some even rate), but logs do not appear on Elastic.
Setup:
Filebeat (+kubernetes +cloud) -> logstash -> elasticsearch
Cure:
Restart filebeat. Once restarted, logs fill in. (even some historic, not sure about everything or just some).
Discovery:
Filebeat follows files (checked in position file).
I've also attached log of failed filebeat (up to where it starts reading files)
somefilebeatlog.txt

I've found few things:

{"level":"error","timestamp":"2018-08-06T16:42:03.411Z","caller":"kubernetes/util.go:73","message":"kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/monitoring/pods/filebeat-notprod-5c9v7: dial tcp 100.64.0.1:443: i/o timeout)"}

and

{"level":"info","timestamp":"2018-08-06T16:41:33.409Z","caller":"add_cloud_metadata/add_cloud_metadata.go:297","message":"add_cloud_metadata: hosting provider type not detected."}

I did check on first one and parsing logs, I found another filebeat pod which is not sending data. Did not check on second message yet.

Few more error messages, which appear time by time in logs (not sure they are errors actually):

{"level":"error","timestamp":"2018-08-07T10:53:53.129Z","caller":"kubernetes/watcher.go:195","message":"kubernetes: Watching API error EOF"}
{"level":"info","timestamp":"2018-08-07T10:53:53.129Z","caller":"kubernetes/watcher.go:179","message":"kubernetes: Watching API for resource events"}

{"level":"warn","timestamp":"2018-08-06T16:43:40.418Z","caller":"transport/tcp.go:36","message":"DNS lookup failure \"elastic.xxxxxxxxxx\": lookup elastic.xxxxxxxxxx on 100.64.0.10:53: read udp 100.117.158.209:43142->100.64.0.10:53: i/o timeout"}
{"level":"warn","timestamp":"2018-08-06T16:43:44.551Z","caller":"transport/tcp.go:36","message":"DNS lookup failure \"log-ingest.xxxxxxxxxx\": lookup log-ingest.xxxxxxxxxx on 100.64.0.10:53: read udp 100.117.158.209:48035->100.64.0.10:53: i/o timeout"}

I'm not sure yet how to fix and what causes problem. Maybe if processors are defined and is not able to detect meta info (e.g. kubernetes, aws), then that shall be fatal error and application shall exit, so it can be restarted. Or maybe it retries several times before exit.
Problem, is that there is no health check to do or logs to see (as they are not coming) to pro-active detect issue and kill pod.

Here is sample metrics. There is two restarts, one I did to check, if restart helps, another was part of scripted restart. After first restarts, logs start coming in.

filebeat-metrics

Filebeat bug containers

Source

evilezh

Most helpful comment

Let me explain one more time.
This is bug. Or maybe behaviour which wouldn't be usually expected (normally compared to other software products, or what I would expect by default).
We all operate with assumption, that something might and will fail. In such case for production grade products (of course if beats are considered production grade), I do expect some behaviour and possible ways to repair.

In scheduler world (kubernetes, mesos) - we have extra components like CNI and we are running like software on software network, dns services etc. So there is a lot of variables and possibilities, that some components might fail or not work perfectly as expected.

In this particular case, BEAT starts up, fails on collecting meta-data. Log files without meta information (container, application etc.) is completely useless as impossible to tell which application did log that. I would expect by default filebeat exit on such errors. All those logs, which are without proper meta information are considered as lost, which is not acceptable in production environments.

As I did explain, I do run on k8s cluster. I have temporary solution in place, which checks few urls used by filebeat processors and fail if can't connect.

So, what I'm asking for - please fail beat by default in case if any of processors fail to work. Or add option to exit BEAT if processor failed.
Let's say - if I do not expect that processor run on system - I will not configure it. If I do - then I expect to work.

evilezh on 26 Aug 2018

👍5

All 18 comments

Please post your question on Discuss: https://discuss.elastic.co/
GH issues are reserved for bugs and enhancement requests.

kvch on 8 Aug 2018

This is bug.
It is not miss-configuration. I clearly stated problem I faced, provided some evidence and some ideas, why it is happening.
I hoped, it would help to identify problem and fix it.

evilezh on 8 Aug 2018

Sorry. Reopened the issue.

kvch on 8 Aug 2018

@evilezh Can you add the version of Filebeat you are using this will help us narrowing the problem.

ph on 10 Aug 2018

Latest version is installed -> 6.3.2.
We did some node reboots and seems that error then happens most likely.

evilezh on 10 Aug 2018

Ok, I had a time to drill down to problem.
So, accounting all information, logs are sent properly, just without meta information and due to lack meta information, pipelines is not properly processing logs.

So - if meta information is not gathered at startup, then it never gets. Which leads to improperly tagged logs, which are not processed accordingly.

As meta-information is very important to sort out logs (e.g. from which server it comes, which application, etc.), I would rather fail filebeat (or have such option) if meta-information can't be retrieved. In absence of such, those logs means nothing. Also I can't re-ingest those logs later, as cursor already moved and logs are acked.

It does not explain one thing - how some historic logs re-appeared after I did restart. But at least something.

Btw, it seems problem is really after server restart/start. Probably filebeat pod get started before full network is available (calico, dns stuff).

As solution, probably would be great to have option to fail filebeat, if some processors not loaded properly or have health check option to see it, some api to ping regards to it.

evilezh on 12 Aug 2018

👍1

I think I have this problem too. And the last time only a reboot of the node fixed it.

towolf on 26 Aug 2018

On machine startup meta-data can change, if some service is adding/updating/removing host meta-data during startup. E.g. some cloud-linux-distributions (on AWS for example) run a service named cloud-init. By default an image still has the host meta-data used while building the image. Only after cloud-init finished will these meta-data be updated.

Beats can not tell how accurate host meta-data are. The corret solution is to properly configure service startup order. If you can not control startup order, you could enhance the docker image with a small script to wait for a known 'signal' on startup.

Don't know if it's a bug or not. I don't think the problem lies with filebeat, but k8s pod startup order.

I agree with @kvch original answer of first posting questions/problems on discuss. And so does the greeting text when opening a new issue:

Please post all questions and issues on https://discuss.elastic.co/c/beats
before opening a Github Issue. Your questions will reach a wider audience there,
and if we confirm that there is a bug, then you can open a new issue.

urso on 26 Aug 2018

As I did explain, I do run on k8s cluster. I have temporary solution in place, which checks few urls used by filebeat processors and fail if can't connect.

evilezh on 26 Aug 2018

👍5

Hi everyone, and sorry for the late response.

add_kubernetes_metadata processor doesn't block the output, so if we didn't retrieve the right metadata by the moment the logs are read, we will ship them without it. This is sometimes common during startup.

To avoid this, you can use autodiscover as an alternative to static conf. As it will start reading logs from Kubernetes events, metadata is ensured to be present in all cases, a simple config like this would do:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      hints.enabled: true

Check this blog post for more info on hints based autodiscover: https://www.elastic.co/blog/docker-and-kubernetes-hints-based-autodiscover-with-beats

exekias on 28 Aug 2018

I agree, that since 6.3 it can be used as workaround. But it is just this single case. There is also cloud processor, there is also other beats.
I just ask to have an option to quit beat in case of processor failure (e.g. network io/dns failure) and do not process logs. That would be constant across all beats and I could have some kind of expected behaviour.

evilezh on 29 Aug 2018

👍2

@evilezh Were you able to solve this problem?

ptemmer on 13 Jun 2019

I have made some progress and identified this may be related to performance issues with etcd.

After upgrading from etcd2 to etcd3, we are noticing that these errors have disappeared from our filebeat logs.

Unfortunately I also upgraded from filebeat 6.7.1 to 6.8.3 so its not totally clear if this may have had an impact.

Can anyone else with this problem share their etcd version to see if there is a relationship.

mikelorant on 16 Oct 2019

I'm running filebeat 7.9.0-SNAPSHOT (to work around another issue https://github.com/elastic/beats/issues/11834)

the config i use is:

filebeat.autodiscover:
providers:
- type: kubernetes
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/lib/docker/containers/${data.container.id}/*.log
tags: ["my_cluster_name"]
ignore_older: 4h
processors:
  - add_fields:
      target: ''
      fields:
        k8s_cluster_name: "my_cluster_name"

And i still got

kubernetes: Querying for pod failed with error: Get "https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-4k8sd": dial tcp 100.64.0.1:443: i/o timeout

only in my case it caused the input from at least one pod to stop completely. Restart of filebeat seemed to help.

klimov-gett on 9 Sep 2020

to anyone who may be intrested - i ended up making sure the API is available in the init-container before launching the actual filebeat:

  initContainers:
  - name: init-k8s-api-checker
    image: curlimages/curl:7.72.0
    command: ['sh', '-c', 'until $(curl https://100.64.0.1:443 -k -s -o /dev/null); do sleep 10; echo \"retrying\"; done; echo \"k8s api responded\"']

klimov-gett on 14 Sep 2020

I'm still having this problem with Elastic 7.9.3 (in docker) and filebeat 7.9.3 in kubernetes. I am also running filebeat 7.9.3 in docker servers and I do get logs, but not from my kubernetes nodes. I've followed the configuration recommended by @ChrsMark but it is not working.

Here is my yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: kube-system
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          templates:
            - config:
                - type: container
                  paths:
                    - /var/log/containers/*${data.kubernetes.container.id}.log
    processors:
      - add_host_metadata:
      - add_kubernetes_metadata:

    monitoring:
      enabled: true

    output.elasticsearch:
      hosts: '${ELASTICSEARCH_HOST}'
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:7.9.3
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
          "-d", "autodiscover",
          "-d", "kubernetes",
        ]
        env:
        - name: ELASTICSEARCH_HOST
          value: "hostname"
        - name: KIBANA_HOST
          value: "hostname"
        - name: ELASTICSEARCH_USERNAME
          value: elastic
        - name: ELASTICSEARCH_PASSWORD
          value: changeme
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
          # If using Red Hat OpenShift uncomment this:
          #privileged: true
        resources:
          limits:
            memory: 400Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlog
          mountPath: /var/log
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0640
          name: filebeat-config
      - name: varlog
        hostPath:
          path: /var/log
      # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
      - name: data
        hostPath:
          # When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - namespaces
  - pods
  verbs:
  - get
  - watch
  - list
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
---

It seems to be finding the container/pod logfiles according to the yaml config, but I do see a strange line in the logs (2020-10-27T13:02:09.145Z DEBUG [autodiscover] template/config.go:156 Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml'):

2020-10-27T13:02:09.145Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:195        Generated config: {
  "paths": [
    "/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log"
  ],
  "type": "container"
}
2020-10-27T13:02:09.145Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:259        Got a meta field in the event
2020-10-27T13:02:09.145Z        DEBUG   [autodiscover]  template/config.go:156  Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml')
2020-10-27T13:02:09.146Z        INFO    log/input.go:157        Configured paths: [/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log]
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  cfgfile/list.go:63      Starting reload procedure, current runners: 0
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  cfgfile/list.go:81      Start list: 1, Stop list: 0
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  template/config.go:156  Configuration template cannot be resolved: field 'data.kubernetes.container.id' not available in event or environment accessing 'paths' (source:'/etc/filebeat.yml')
2020-10-27T13:02:09.146Z        INFO    log/input.go:157        Configured paths: [/var/log/containers/*894a21c98d8cee4cd61e4dc2c4a281221ae9e915adead904f236185b2f7f5468.log]
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  cfgfile/list.go:100     Starting runner: input [type=container]
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:174        Got a start event: map[config:[] host:10.209.72.50 id:20a7e6f1-df54-4c72-84de-d1c67895ba1b kubernetes:{"annotations":{"kubernetes":{"io/psp":"speaker"},"linkerd":{"io/inject":"enabled"},"prometheus":{"io/port":"7472","io/scrape":"true"}},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}} meta:{"kubernetes":{"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}}} ports:{"monitoring":7472} provider:1e5d4196-e211-4f96-af59-29cee8fd164b start:true]
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:259        Got a meta field in the event
2020-10-27T13:02:09.146Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:174        Got a start event: map[config:[0xc0006328a0] host:10.209.72.50 id:20a7e6f1-df54-4c72-84de-d1c67895ba1b.speaker kubernetes:{"annotations":{"kubernetes":{"io/psp":"speaker"},"linkerd":{"io/inject":"enabled"},"prometheus":{"io/port":"7472","io/scrape":"true"}},"container":{"id":"e8a250df60c9334ba2c09917a586b1d3a0655f4799bade7cf0359485270151c1","image":"metallb/speaker:v0.8.2","name":"speaker","runtime":"docker"},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}} meta:{"container":{"id":"e8a250df60c9334ba2c09917a586b1d3a0655f4799bade7cf0359485270151c1","image":{"name":"metallb/speaker:v0.8.2"},"runtime":"docker"},"kubernetes":{"container":{"image":"metallb/speaker:v0.8.2","name":"speaker"},"labels":{"app":"metallb","component":"speaker","controller-revision-hash":"787547c99f","pod-template-generation":"1"},"namespace":"metallb-system","node":{"name":"k8s-bdlk-001"},"pod":{"name":"speaker-q2kjr","uid":"20a7e6f1-df54-4c72-84de-d1c67895ba1b"}}} port:7472 provider:1e5d4196-e211-4f96-af59-29cee8fd164b start:true]