While investigating why some metrics stopped appearing in ES, I found that these rely on the kubernetes autodiscover provider.
The reason seems to be that the k8s API fails to respond in time (this is an Azure AKS cluster and the underlying reason is probably due to AKS).
What happens is that when this API fails, the metricbeat reports errors like
2019-07-26T08:11:35.600Z ERROR kubernetes/watcher.go:185 kubernetes: Performing a resource sync err performing request: Get https://10.0.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Daks-nodepool1-36691180-7&resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout for *v1.PodList
and continues running. A retry behaviour would make sense here I believe. It would even have been better if metricbeat crashed in this case, so that k8s could restart it. Now we have to manually find this problem and go in and kill the pod.
Found in metricbeat version 7.2.0 using the official Docker image.
Hi @rickardp
when you say that metricbeat continues running I guess that it does as a process, but that watcher process stops and does not retry. Is that the case?
If so, I agree that's very far from the expected behaviour.
Can you please confirm and share your configuration yaml files?
Yes, that is what happened. I will dig out some config files, but in the meantime I can mention that I used almost an exact copy of the Helm stable charts (adding only the ES writer configuration and some processor steps).
I can also mention that I had to kill the metricbeat pod manually to recover.
Edit: Btw, forget about my previous comment about disk space. We had two issues that I mixed up.
Thanks for confirmation.
I'm not familiar with those helm charts nor with the autodiscover code, but I guess that with some basic configuration and making the apiserver unavailable for a minute or so, will make it fail/stop informing metrics.
Let me try that
Here鈥檚 the config YAML that was in use at the time it failed:
cloud.auth: ${ELASTIC_CLOUD_AUTH}
cloud.id: ${ELASTIC_CLOUD_ID}
logging.level: info
metricbeat.autodiscover:
providers:
- host: ${NODE_NAME}
include_annotations:
- prometheus.io/scrape
templates:
- condition:
contains:
kubernetes.annotations.prometheus.io/scrape: "true"
config:
- hosts: ${data.host}:${data.kubernetes.annotations.prometheus.io/port}
metricsets:
- collector
module: prometheus
type: kubernetes
- host: ${NODE_NAME}
include_annotations:
- metricbeat.elasticsearch/scrape
templates:
- condition:
contains:
kubernetes.annotations.metricbeat.elasticsearch/scrape: "true"
config:
- hosts: ${data.host}:9200
metricsets:
- node
- node_stats
- index
- index_recovery
- index_summary
- shard
- ml_job
module: elasticsearch
period: 30s
type: kubernetes
metricbeat.config:
modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
output.elasticsearch:
hosts:
- ${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}
password: ${ELASTICSEARCH_PASSWORD}
username: ${ELASTICSEARCH_USERNAME}
processors:
- add_cloud_metadata: {}
- add_fields:
fields:
cluster: ${CLUSTER_NAME}
target: ""
- drop_fields:
fields:
- kubernetes.labels.app
- kubernetes.pod._module.labels.app
- kubernetes.container._module.labels.app
setup.ilm.enabled: auto
setup.ilm.overwrite: true
setup.ilm.pattern: '{now/d}-000001'
setup.ilm.rollover_alias: metricbeat
setup.template.overwrite: true
@rickardp
I wasn't able to reproduce the issue, and as of the code, when an error occurs the loop continues creating a new watcher for the resource at the apiserver.
However, the issue that you mention can happen if reaching the apiserver fails at the first try, which seems to be the issue you are hitting.
I'm not sure what is the take of beats or metricbeat in this case.
It looks like metricbeat try not to exit if it can cope with whatever issue, unless it is a wrong formated config, or elasticsearch/kibana non reachable.
On kubernetes autodiscover it looks like if it fails the first time to do the initial sync it will fail and wont retry. Once the initial sync is done, the watch loop will be error tolerant as it will respawn a new watcher.
@exekias do you know yourself, or know who to ask if this is the intended behaviour?
I would say that's the right behavior? If we cannot connect to API server in the first try this may be a missconfiguration, We could keep retrying at the risk of the user not noticing about the issue.
In the other hand, if config is good but the API server is down for whatever reason, the beat will die. But it will probably be rescheduled until connection works?
not sure if I follow.
If current behaviour is the right one, be aware that #13051 is changing it (I was testing it when it got merged)
Anycase, @rickardp as of the referenced merge your problems should be solved, it should be shipped at the next release.
You are right! I think we can close this then
Yes, this happened when it couldn't contact the API server the first time. I did not observe what happened when it was successful and later failed, but it is indeed possible it recovers in that case.
Good to hear it's solved then! A full restart of the nodes took care of the i/o timeout issue. The joy of Azure AKS :/
I don't think this is the correct behavior. For example, if you are using istio, the sidecar may not be fully loaded for a few moments. An immediate request to the kubernetes API might fail, but a short delay satisfies it.
Can I suggest an exponential backoff? That seems to be the strategy with connection failures to elasticsearch.
Hi @prestonvanloon,
Thanks for the input, we updated the behavior by moving the official k8s client here: https://github.com/elastic/beats/pull/13051, we do a lazy start now so any temporal error should make the client retry, with the backoff.
@exekias I am seeing the same behavior as @prestonvanloon when using filebeat 7.4 with Istio. Filebeat starts before the side car and the k8s client never recovers. As a result I never get any of the kubernetes meta data with the container logs. I overroad the entry point to the filebeat DaemonSet prefixing a short sleep which allowed for the client to start successfully.
- command: ["/bin/sh", "-c"]
args:
- sleep 10s;
/usr/local/bin/docker-entrypoint -e -E http.enabled=true;
Do you have any better suggestions to workaround this issue? Is the fix describe in version 7.4.1?
Most helpful comment
I don't think this is the correct behavior. For example, if you are using istio, the sidecar may not be fully loaded for a few moments. An immediate request to the kubernetes API might fail, but a short delay satisfies it.
Can I suggest an exponential backoff? That seems to be the strategy with connection failures to elasticsearch.