Feature:
For Metricbeat Kafka module, it would be nice to:
The issue should probably be a part of https://github.com/elastic/beats/issues/3005
regarding metricbeats and the kafka module...Kafka lag per topic/partition would be the most useful metric for monitoring processing
I was equally interested in having the consumer lag per partition_id/topic and i labored to make it work in my repository --> https://github.com/antonking/GOLANG
that's awesome @antonking, could you open a PR with your changes applied to metricbeat?
@exekias how do you do that? I've no idea what PR means to be honest.
Sorry @antonking PR == Pull Request, I can assist in the process if needed :)
@exekias please do assist and if you need anything from me in the process, kindly let me know. Thx lots
@antonking @exekias I would like this to be a feature of kafka metricbeat, for us its the metric most useful. Did you have any joy with the PR?
@Mrclark05 please feel free to make the feature a PR because i've never done that before. Thx
A kafka dashboard has been merged to master (#8457) it includes a visualization for consumer lag. This value is still not available as a single value, but it is calculated using offset values from consumergroup and partition metricsets.
It'd be great if you could give a try to it, we plan to release this first dashboard on 6.5.0.
Let us know if the dashboards are not enough.
I've been working with the new Kafka dashboard, and if I'm understanding it correctly, it's not quite right.
The current lag visualization takes the highest offset value from the partition stats and the highest offset value from the consumer group stats. The lag value is generated by taking the difference between the two (or returning zero if the consumergroup offset is higher). We can then group the results by another field, for example, the Kafka topic name. There are still a few issues with this approach, though.
If we have multiple clients in a single consumer group, the lag value doesn't represent the true lag of the consumer group. The Max aggregations mean that we may be comparing the consumergroup.offset value from one client against the partition.offset.newest value of a partition which that client is not connecting to.
If we have more than one Kafka consumer group accessing the same topic, then there's no meaningful way that I see to associate a lag value to one consumer group or the other without hard-coding in each consumer group. In a large deployment, that can be problematic.
The root of the issue is that we have to compare across data types to get the lag value (the consumergroup metricset vs. the partition metricset). The easiest solution looks to be including the lag value in the consumergroup metricset. Lag can then be associated with any combination of client ID, consumergroup ID, topic, and partition.
@tkinkead thanks a lot for your feedback, I am going to reopen the issue.
If we have multiple clients in a single consumer group, the lag value doesn't represent the true lag of the consumer group. The Max aggregations mean that we may be comparing the consumergroup.offset value from one client against the partition.offset.newest value of a partition which that client is not connecting to.
Yes, you are right about that, this can be a not so bad overview value if all partitions grow at the same rate, but I am not sure if this is always the case.
If you select topic and partition, is the value shown correct?
If we have more than one Kafka consumer group accessing the same topic, then there's no meaningful way that I see to associate a lag value to one consumer group or the other without hard-coding in each consumer group. In a large deployment, that can be problematic.
Yes, you are right, I am afraid this case won't work. We should probably aggregate per consumer group, not only per topic.
The root of the issue is that we have to compare across data types to get the lag value (the consumergroup metricset vs. the partition metricset). The easiest solution looks to be including the lag value in the consumergroup metricset. Lag can then be associated with any combination of client ID, consumergroup ID, topic, and partition.
Metricbeat aims to collect mainly metrics of local services. It can also be configured to monitor services on other hosts or in cloud providers, but the intended premise is that metricbeat don't have to connect to other hosts unless otherwise needed.
For this case what we do is to collect all the offset we can collect from the local brokers, and then calculate the lag on query/visualization time. This is more complicated, but it should work and it respects the intended premise.
Said that, maybe it worths in this case to (at least optionally) connect to the partition leader from the consumergroup metricset and directly calculate the lag to gain in simplicity for this important value.
If you select topic and partition, is the value shown correct?
No, because we have consumers in multiple consumer groups connected to the same topic and partition. If I select topic and partition in the dashboard, I will still only see the consumer group with the highest lag value for that topic and partition, not the lag values for all consumer groups.
Metricbeat aims to collect mainly metrics of local services. It can also be configured to monitor services on other hosts or in cloud providers, but the intended premise is that metricbeat don't have to connect to other hosts unless otherwise needed.
That's understandable. The best way to get the lag value is to pull it from the zookeeper API directly, rather than connecting to partition leaders and collecting offsets and then doing the calculations. It's possible that this simply isn't the correct way to do this. Perhaps lag and consumer group data should be built into a kafka metricset on the zookeeper module instead?
Hey @tkinkead !
We have introduced consumer_lag metric in #14822 (the difference between the partition offset and consumer offset) and we updated the dashboard so as to make use of this new metric in #14863.
Let us know what you think about this, and if you agree we can close this issue.
Thanks!
Hi, can I just ask how does Metricbeat gets its offsets since from Kafka 2.1.x onwards offsets are stored in the kafka brokers? Is it zookeeper or the individual brokers?
Hi, can I just ask how does Metricbeat gets its offsets since from Kafka 2.1.x onwards offsets are stored in the kafka brokers? Is it zookeeper or the individual brokers?
Hi @Lumotheninja , consumer offsets are being retrieved from brokers.
Right, just wanted to raise that question since @tkinhead suggested zookeeper. +1 on PR
Closing for now. Please reopen if the issue is not solved.
Most helpful comment
regarding metricbeats and the kafka module...Kafka lag per topic/partition would be the most useful metric for monitoring processing