Beats: Filebeats errors and 100% CPU with stand alone Kafka

Created on 6 Nov 2016 · 9Comments · Source: elastic/beats

Ref: https://discuss.elastic.co/t/filebeats-errors-and-100-cpu-with-stand-alone-kafka/64657

Version: filebeat-5.0.0-1 rpm, kafka 0.10.1.0, zookeeper 3.4.9
Operating System: Centos 7
Steps to Reproduce:
Setup a stand along (single instance) of zookeeper and kafka.
Setup filebeats to write to the kafka topic, using the defaults in the metadata section
Observe filebeat log. It will loop every 250ms, with warnings about the topic being leaderless. It does this more than retry.max value. The code appears to not expect a single node instance, which will have a replication factor of 0, so will never have a leader value of 1.

To observe the high CPU load, shutdown Kafka while filebeat is running in the above configuration.

Configuration I used is setup filebeats Kafka output, with
metadata:
retry.max: 3
retry.backoff: 250ms
refresh_frequency: 10m

The retry.max did count down, but then looped and started counting down again from 3. It did it at the retry.backoff interval. Increasing retry.backoff slowed the loop, but it still kept looping.

What I think it should do, is realise that the leader value is going to be 0, because replicas are 0, so to stop trying, or try at the refresh_frequency.

Suggest the bug may be around this area:
https://github.com/elastic/beats/blob/3baa352e6fb68cb5ff8abb25e84bb0557c1a5e28/vendor/github.com/Shopify/sarama/client.go#L593

:Outputs Integrations bug libbeat

Source

nelg

👍6

Most helpful comment

Thanks @joeythelantern

The main problem is that our Filebeats are automaticaly deployed on several production servers.
As the error is human, it's possible to encounter a configuration issue during the deployment of the filebeat configuration file.

For example, if the kafka topic name is not correctly written and we have set the "auto-create topic" setting to false, it can lead to 100% CPU usage, which is not really acceptable.

nicolasterral on 2 Jun 2017

👍2

All 9 comments

Observed similar error output for winlogbeat v5.1.2 to kafka 0.10.0.1.

2017-01-25T09:39:39+02:00 WARN kafka message: Successfully initialized new client
2017-01-25T09:39:39+02:00 WARN client/metadata fetching metadata for [wineventlog] from broker <broker>:9092

2017-01-25T09:39:39+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-01-25T09:39:39+02:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)

JPvRiel on 25 Jan 2017

Observed similar behavior : 100% CPU Usage on filebeat host.
I use Filebeat 5.1.1 sending to Kafka 0.10.0.2.5 (HDP 2.5)

Here is an extract of the logs :

2017-04-12T11:46:50+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (2 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667
2017-04-12T11:47:01+02:00 WARN kafka message: client/metadata found some partitions to be leaderless
2017-04-12T11:47:01+02:00 WARN client/metadata retrying after 250ms... (1 attempts remaining)
2017-04-12T11:47:01+02:00 WARN client/metadata fetching metadata for [logstash] from broker mybroker:6667

It's easy to reproduce, you just have to configure your Filebeat to send logs in a topic which is not created yet. You don't even need to produce logs continuously in the read path, it starts using 100% CPU as soon as it has a single message to send.

nicolasterral on 13 Apr 2017

yeah, if kafka is down the library used by beats tries to update internals every so often. As kafka is normally assumed to be unavailable for a very-very short period of time, timeouts are quite low as well. e.g. logstash output has a retry with exponential backoff.

Have you tried to increase the retry_timeout to 1s?

urso on 20 Apr 2017

Thank you for the answer @urso .

I assume that by _retry_timeout_ you mean _retry_backoff_ ?

We have tried to play a bit with this parameter and this is what happen :

retry_backoff < 3350 ms = 1 CPU at ~100% CPU usage
retry_backoff >= 3350 ms = no particular CPU usage

It does not seem to be linked with the server size : we have tried it on several servers and the behaviour is the same. The CPU usage blows up at 3350 ms and I don't get why.

nicolasterral on 15 May 2017

I was having a similar issue with both Metricbeat and Filebeat, where I had only one kafka server.

The error for my setup was discovered by a coworker, and he found that when filebeat and metricbeat created their topics in kafka, the beats most likely assume a cluster of at least 2 brokers, or more than 1 replication, and you cannot configure them to change their default topic creation settings.

So I setup kafka first, created my topics manually all set for 1 kafka server, then installed filebeat and metricbeat.

Worked like a charm.

Try creating your topics first, before installing filebeat, and see if that works. Hope this helps

joeythelantern on 1 Jun 2017

Thanks @joeythelantern

For example, if the kafka topic name is not correctly written and we have set the "auto-create topic" setting to false, it can lead to 100% CPU usage, which is not really acceptable.

nicolasterral on 2 Jun 2017

👍2

Is this bug fixed?

WanliTian on 2 Jul 2017

One more detail. The circuit breaker might be open when 100% CPU is reached:

2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open
2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open
2018-03-05T08:48:02+01:00 INFO Error publishing events (retrying): circuit breaker is open

There are 2 circuit breakers in the client. One on meta-data updates and one on forwarding events to a partition worker. The meta-data update breaker should be subject to metadata.retry.backoff. The other one might result in a "tight loop", not subject to any backoff setting.

urso on 9 Mar 2018

No solution yet?