Telegraf: Add Monitoring for Kafka Topics

Created on 20 Jul 2018 · 8Comments · Source: influxdata/telegraf

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Add an input plugin to monitor Kafka topics and their respective partitions. The user configures the broker addresses and topics to be monitored. For each partition of the topic the following metrics should be gathered:

replica count
in-sync replica count
out-of-sync replica count (difference of the two above)
offset and timestamp of the oldest retained message
offset and timestamp of the newest retained message
offset difference and duration between oldest and newest message

Additionally consumer groups could be configured to allow monitoring of the following metrics per partition:

consumer offset
offset lag (difference newest offset and consumer offset)

Each measurement should be tagged with topic, partition and consumer group, if applicable.

Current behavior:

Using the scripts and classes provided by Kafka creating these measurements is possible with the exec plugin (perhaps by leveraging jq). However this solution leads to starting java jobs and the associated jvms on every gathering run. Having more knowledge about the clients, the Jolokia plugin might be used to extract similar metrics via JMX.

Desired behavior:

Have a dedicated input plugin as part of Telegraf to acquire the described measurements.

Use case: [Why is this important (helps with prioritizing requests)]

The proposed plugin allows monitoring of Kafka topics in an producer and consumer agnostic way. It helps analysing data flow through a Kafka cluster.

arekafka feature request

Source

KarstenSchnitter

Most helpful comment

We took a deep dive into the JMX metrics exposed by the Kafka broker. We found mbeans exposing the state of partitions at "kafka.cluster.Partition.BEAN. TOPIC_NAME:type=PARTITION_ID". The provided beans for BEAN are:

InSyncReplicasCount
LastStableOffsetLag
ReplicasCount
UnderMinIsr
UnderReplicated

Furthermore, metrics on the stored data is available at "kafka.log.Log.BEAN. TOPIC_NAME:type=PARTITION_ID". The provided beans for BEAN are:

LogEndOffset
LogStartOffset
NumLogSegments
Size

Finally, the broker provides additional metrics on the topics at "kafka.server.BrokerTopicMetrics:type=BEAN,name=TOPIC_NAME". The available metrics include rates for incoming bytes and messages.

Using these information we can answer the first two questions posted above:

Is the partition properly replicated?
What is the number of messages posted to the partition?

However, the third question:

What is the actual retention period of the partition?

Cannot be answered either by Burrow or JMX.

On a further note, it seems that while monitoring of Kafka using JMX is possible, it still requires a lot of configuration and knowledge about the MBeans. That may not be so bad as it requires some effort to understand the acquired information anyway. But it is a steep learning curve setting up Kafka monitoring with telegraf.

KarstenSchnitter on 14 Aug 2018

👍3 ❤1

All 8 comments

Is that something that could be added to the current kafka_consumer plugin, or is this something different?

glinton on 20 Jul 2018

You may be interested in trying the burrow input. I believe Burrow gathers most, if not all, of these metrics. Let us know if this seems like a good solution for you.

danielnelson on 20 Jul 2018

I will give Burrow a try. And come back here.

KarstenSchnitter on 21 Jul 2018

We had a closer look at Burrow. We really like the effort it does in providing meaningful measurements for consumer groups especially the offset lag. From the metrics mentioned in the initial comment we found the following support:

replica count -> not supported, but partition status reported
in-sync replica count -> not supported, but partition status reported
out-of-sync replica count -> not supported, but partition status reported
offset and timestamp of the oldest retained message -> supported for interval stored in Burrow
offset and timestamp of the newest retained message -> supported for interval stored in Burrow
offset difference and duration between oldest and newest message -> unsupported
consumer offset -> supported
offset lag -> full support via custom model

In short, Burrow provides a powerful model for monitoring Kafka consumer groups. This use-case is well covered by Burrow and the respective plugin.

For monitoring the overall topic state, we see some deficits. Basically we are interested in three things:

Is the partition properly replicated?
What is the number of messages posted to the partition?
What is the actual retention period of the partition?

The first question (partition replication) allows to detect malfunctions and aids mitigation of issues. Knowing this number on a partition level also helps to find affected data or customers more easily. As far as we understand, Burrow supports this use-case with a partition state, but not with the concrete numbers.

The second question (number of posted messages) can be answered by the derivative of the partition end-offset. This number can be used to repartition or reassign topics according to there load. Burrow reports the end-offset, so it serves this use-case. The originally posted aim of knowing the start offset and timestamps as well is unsupported.

The third question (retention period) requires the timestamp of the earliest and latest message in each partition. This is useful if a retention configuration based on size is used and it is still interesting how long the retention period is. Burrow has no support for this. Getting the earliest offset or timestamp of a partition requires polling that message, which Burrow does not do.

A last part of our investigation was to compare architectures. Burrow acquires the metrics by an external program (Burrow) that queries the cluster. It is deployed independent of the Kafka cluster and has its own downtime. On the other hand we monitor our Kafka deployment with telegraf co-deployed on each Kafka broker. Having a dedicated plugin for topic monitoring allows for the same deployment cycle as the Kafka cluster nodes.

Running a dedicated Burrow node increases the deployment. For us this would be unproblematic, since we already have a Kafka management node running Kafka Manager and Cruise Control. Adding Burrow means little overhead. But this may be different for other users.

Maybe in a co-located monitoring use-case it would be nice to limit topic monitoring to the partitions the current broker has the leading partition. This would also show the leader distribution.

KarstenSchnitter on 1 Aug 2018

Do you know if any of these remaining pieces are exposed as JMX metrics? The example jolokia2 config for kafka might be helpful for getting started with this plugin.

danielnelson on 1 Aug 2018

We do monitor Kafka with JMX and Jolokia. I did not find these metrics there, but I will check again whether I overlooked something. I will get back and report my findings.

KarstenSchnitter on 8 Aug 2018

vamshireddy21 on 13 Aug 2018

InSyncReplicasCount
LastStableOffsetLag
ReplicasCount
UnderMinIsr
UnderReplicated

Furthermore, metrics on the stored data is available at "kafka.log.Log.BEAN. TOPIC_NAME:type=PARTITION_ID". The provided beans for BEAN are:

LogEndOffset
LogStartOffset
NumLogSegments
Size

Finally, the broker provides additional metrics on the topics at "kafka.server.BrokerTopicMetrics:type=BEAN,name=TOPIC_NAME". The available metrics include rates for incoming bytes and messages.

Using these information we can answer the first two questions posted above: