Ref: https://discuss.elastic.co/t/filebeats-errors-and-100-cpu-with-stand-alone-kafka/64657
To observe the high CPU load, shutdown Kafka while filebeat is running in the above configuration.
Configuration I used is setup filebeats Kafka output, with
metadata:
retry.max: 3
retry.backoff: 250ms
refresh_frequency: 10m
The retry.max did count down, but then looped and started counting down again from 3. It did it at the retry.backoff interval. Increasing retry.backoff slowed the loop, but it still kept looping.
What I think it should do, is realise that the leader value is going to be 0, because replicas are 0, so to stop trying, or try at the refresh_frequency.
Suggest the bug may be around this area:
https://github.com/elastic/beats/blob/3baa352e6fb68cb5ff8abb25e84bb0557c1a5e28/vendor/github.com/Shopify/sarama/client.go#L593
Observed similar error output for winlogbeat v5.1.2 to kafka 0.10.0.1.
2017-01-25T09:39:39+02:00 WARN kafka message: Successfully initialized new client
2017-01-25T09:39:39+02:00 WARN client/metadata fetching metadata for [wineventlog] from broker <broker>:9092
2017-01-25T09:39:39+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-01-25T09:39:39+02:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)
Observed similar behavior : 100% CPU Usage on filebeat host.
I use Filebeat 5.1.1 sending to Kafka 0.10.0.2.5 (HDP 2.5)
Here is an extract of the logs :
2017-04-12T11:46:50+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (2 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (1 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
It's easy to reproduce, you just have to configure your Filebeat to send logs in a topic which is not created yet. You don't even need to produce logs continuously in the read path, it starts using 100% CPU as soon as it has a single message to send.
yeah, if kafka is down the library used by beats tries to update internals every so often. As kafka is normally assumed to be unavailable for a very-very short period of time, timeouts are quite low as well. e.g. logstash output has a retry with exponential backoff.
Have you tried to increase the retry_timeout to 1s?
Thank you for the answer @urso .
I assume that by _retry_timeout_ you mean _retry_backoff_ ?
We have tried to play a bit with this parameter and this is what happen :
It does not seem to be linked with the server size : we have tried it on several servers and the behaviour is the same. The CPU usage blows up at 3350 ms and I don't get why.
I was having a similar issue with both Metricbeat and Filebeat, where I had only one kafka server.
The error for my setup was discovered by a coworker, and he found that when filebeat and metricbeat created their topics in kafka, the beats most likely assume a cluster of at least 2 brokers, or more than 1 replication, and you cannot configure them to change their default topic creation settings.
So I setup kafka first, created my topics manually all set for 1 kafka server, then installed filebeat and metricbeat.
Worked like a charm.
Try creating your topics first, before installing filebeat, and see if that works. Hope this helps
Thanks @joeythelantern
The main problem is that our Filebeats are automaticaly deployed on several production servers.
As the error is human, it's possible to encounter a configuration issue during the deployment of the filebeat configuration file.
For example, if the kafka topic name is not correctly written and we have set the "auto-create topic" setting to false, it can lead to 100% CPU usage, which is not really acceptable.
Is this bug fixed?
One more detail. The circuit breaker might be open when 100% CPU is reached:
2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open
2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open
2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open
There are 2 circuit breakers in the client. One on meta-data updates and one on forwarding events to a partition worker. The meta-data update breaker should be subject to metadata.retry.backoff. The other one might result in a "tight loop", not subject to any backoff setting.
No solution yet?
Most helpful comment
Thanks @joeythelantern
The main problem is that our Filebeats are automaticaly deployed on several production servers.
As the error is human, it's possible to encounter a configuration issue during the deployment of the filebeat configuration file.
For example, if the kafka topic name is not correctly written and we have set the "auto-create topic" setting to false, it can lead to 100% CPU usage, which is not really acceptable.