Beats: Kubernetes autodiscover provider fails silently if it can't connect to k8s

Created on 26 Jul 2019 · 12Comments · Source: elastic/beats

While investigating why some metrics stopped appearing in ES, I found that these rely on the kubernetes autodiscover provider.

The reason seems to be that the k8s API fails to respond in time (this is an Azure AKS cluster and the underlying reason is probably due to AKS).

What happens is that when this API fails, the metricbeat reports errors like

2019-07-26T08:11:35.600Z    ERROR   kubernetes/watcher.go:185   kubernetes: Performing a resource sync err performing request: Get https://10.0.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Daks-nodepool1-36691180-7&resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout for *v1.PodList

and continues running. A retry behaviour would make sense here I believe. It would even have been better if metricbeat crashed in this case, so that k8s could restart it. Now we have to manually find this problem and go in and kill the pod.

Found in metricbeat version 7.2.0 using the official Docker image.

Integrations bug containers

Source

rickardp

Most helpful comment

I don't think this is the correct behavior. For example, if you are using istio, the sidecar may not be fully loaded for a few moments. An immediate request to the kubernetes API might fail, but a short delay satisfies it.

Can I suggest an exponential backoff? That seems to be the strategy with connection failures to elasticsearch.

prestonvanloon on 1 Oct 2019

❤1 👍1

All 12 comments

Hi @rickardp

when you say that metricbeat continues running I guess that it does as a process, but that watcher process stops and does not retry. Is that the case?
If so, I agree that's very far from the expected behaviour.

Can you please confirm and share your configuration yaml files?

odacremolbap on 29 Jul 2019

Yes, that is what happened. I will dig out some config files, but in the meantime I can mention that I used almost an exact copy of the Helm stable charts (adding only the ES writer configuration and some processor steps).

I can also mention that I had to kill the metricbeat pod manually to recover.

Edit: Btw, forget about my previous comment about disk space. We had two issues that I mixed up.

rickardp on 29 Jul 2019

Thanks for confirmation.

I'm not familiar with those helm charts nor with the autodiscover code, but I guess that with some basic configuration and making the apiserver unavailable for a minute or so, will make it fail/stop informing metrics.

Let me try that

odacremolbap on 30 Jul 2019

Here’s the config YAML that was in use at the time it failed:

cloud.auth: ${ELASTIC_CLOUD_AUTH}
cloud.id: ${ELASTIC_CLOUD_ID}
logging.level: info
metricbeat.autodiscover:
  providers:
  - host: ${NODE_NAME}
    include_annotations:
    - prometheus.io/scrape
    templates:
    - condition:
        contains:
          kubernetes.annotations.prometheus.io/scrape: "true"
      config:
      - hosts: ${data.host}:${data.kubernetes.annotations.prometheus.io/port}
        metricsets:
        - collector
        module: prometheus
    type: kubernetes
  - host: ${NODE_NAME}
    include_annotations:
    - metricbeat.elasticsearch/scrape
    templates:
    - condition:
        contains:
          kubernetes.annotations.metricbeat.elasticsearch/scrape: "true"
      config:
      - hosts: ${data.host}:9200
        metricsets:
        - node
        - node_stats
        - index
        - index_recovery
        - index_summary
        - shard
        - ml_job
        module: elasticsearch
        period: 30s
    type: kubernetes
metricbeat.config:
  modules:
    path: ${path.config}/modules.d/*.yml
    reload.enabled: false
output.elasticsearch:
  hosts:
  - ${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}
  password: ${ELASTICSEARCH_PASSWORD}
  username: ${ELASTICSEARCH_USERNAME}
processors:
- add_cloud_metadata: {}
- add_fields:
    fields:
      cluster: ${CLUSTER_NAME}
    target: ""
- drop_fields:
    fields:
    - kubernetes.labels.app
    - kubernetes.pod._module.labels.app
    - kubernetes.container._module.labels.app
setup.ilm.enabled: auto
setup.ilm.overwrite: true
setup.ilm.pattern: '{now/d}-000001'
setup.ilm.rollover_alias: metricbeat
setup.template.overwrite: true

rickardp on 30 Jul 2019

@rickardp

I wasn't able to reproduce the issue, and as of the code, when an error occurs the loop continues creating a new watcher for the resource at the apiserver.

However, the issue that you mention can happen if reaching the apiserver fails at the first try, which seems to be the issue you are hitting.

I'm not sure what is the take of beats or metricbeat in this case.

It looks like metricbeat try not to exit if it can cope with whatever issue, unless it is a wrong formated config, or elasticsearch/kibana non reachable.
On kubernetes autodiscover it looks like if it fails the first time to do the initial sync it will fail and wont retry. Once the initial sync is done, the watch loop will be error tolerant as it will respawn a new watcher.

@exekias do you know yourself, or know who to ask if this is the intended behaviour?

odacremolbap on 31 Jul 2019

I would say that's the right behavior? If we cannot connect to API server in the first try this may be a missconfiguration, We could keep retrying at the risk of the user not noticing about the issue.

In the other hand, if config is good but the API server is down for whatever reason, the beat will die. But it will probably be rescheduled until connection works?

exekias on 31 Jul 2019

not sure if I follow.
If current behaviour is the right one, be aware that #13051 is changing it (I was testing it when it got merged)

Anycase, @rickardp as of the referenced merge your problems should be solved, it should be shipped at the next release.

odacremolbap on 31 Jul 2019

You are right! I think we can close this then

exekias on 31 Jul 2019

Yes, this happened when it couldn't contact the API server the first time. I did not observe what happened when it was successful and later failed, but it is indeed possible it recovers in that case.

Good to hear it's solved then! A full restart of the nodes took care of the i/o timeout issue. The joy of Azure AKS :/

rickardp on 31 Jul 2019

👍1

Can I suggest an exponential backoff? That seems to be the strategy with connection failures to elasticsearch.

prestonvanloon on 1 Oct 2019

❤1 👍1

Hi @prestonvanloon,

Thanks for the input, we updated the behavior by moving the official k8s client here: https://github.com/elastic/beats/pull/13051, we do a lazy start now so any temporal error should make the client retry, with the backoff.

exekias on 2 Oct 2019

@exekias I am seeing the same behavior as @prestonvanloon when using filebeat 7.4 with Istio. Filebeat starts before the side car and the k8s client never recovers. As a result I never get any of the kubernetes meta data with the container logs. I overroad the entry point to the filebeat DaemonSet prefixing a short sleep which allowed for the client to start successfully.
- command: ["/bin/sh", "-c"] args: - sleep 10s; /usr/local/bin/docker-entrypoint -e -E http.enabled=true;
Do you have any better suggestions to workaround this issue? Is the fix describe in version 7.4.1?