Micrometer: Register Kafka "records-lag" metric per topic-partition

Created on 26 Sep 2018  路  17Comments  路  Source: micrometer-metrics/micrometer

Affects: master version of micrometer-core (SNAPSHOT-1.1.0-20180921.212031-73).

This is in a way related to #877.

I've been trying to get "records-lag" metric (or "records-lag-max" would have been fine too) per topic-partition to show up in /actuator/prometheus/.
I'm not sure if current KafkaConsumerMetrics is expected to register it or not, at least KafkaConsumerMetrics.java#L236 hints that it's interested in per topic metrics, but currently it does not, and I think it is a very valuable metric for applications consuming more than 1 topic.

After digging around in code, I believe there are some issues with how current implementation of KafkaConsumerMetrics registers the metrics.

  1. It tries to register all metrics with all combinations of tags, instead of only using existing combinations.
    At KafkaConsumerMetrics.java#L201 JXM NotificationListener is called first with only client-id (many times actually, but maybe that's kafka-clients issue), then with [client-id, topic], then with [client-id, topic, partition] and each time it goes through the whole list at KafkaConsumerMetrics.java#L76.

For the list of metric-tags combinations actually coming from Kafka via JMX, see:
kafka FetcherMetricsRegistry.java#L66
kafka Fetcher.java#L1326

  1. It does add "topic" tag, but not "partition" tag, whereas e,g, "records-lag-max" is only exported by Kafka per topic-partition, not per topic. So it's not clear how meaningful "records-lag-max" per topic will be even if it does get added to Prometheus registry.
    nameTag(), KafkaConsumerMetrics.java#L240

The way to address this I believe is to have registerMetricsEventually have 3 callbacks, one for each combination of tags coming from Kafka:

  • client-id
  • client-id, topic
  • client-id, topic, partition

Each callback will then only register metrics defined for that tags combination.

If KafkaConsumerMetrics wants to support "records-lag-max" on 2 levels, then #877 needs to be solved too.

Another approach is to add it (or "records-lag") on topic-partition level only, not on client-id level, and then aggregate in Prometheus if desired. Personally I've implemented a MeterBinder to do just this.

bug

Most helpful comment

Ah, so you are saying the partition is embedded in the MBean _name_. Now I think I understand.

All 17 comments

I'm no Kafka expert, so I could really use some help in the form of additional tests that demonstrate what you're talking about.

On (1), when I test it with KafkaConsumerMetricsTest, I only ever see the records-lag-max meter created with the client tag. I don't see a second iteration adding topic or a third with partition. What am I doing wrong?

On (2), again, I only see client-id metadata in JMX, and nothing about partition. What am I doing wrong?

image

@jkschneider I don't see a per-partition level Consumer metrics exposed by Kafka via the MBeans. Please refer to - https://docs.confluent.io/current/kafka/monitoring.html
All the consumer-fetch-manager-metrics are per consumer threads.
```Fetch Metrics
MBean: kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)

@wardhapk I think we are saying the same thing. I also don't see per-partition consumer metrics, and Micrometer isn't attempting to tag on partition.

@jkschneider Just figured that newer versions of kafka clients are exposing a per-partition level metrics records-lag in addition to records-lag-avg & records-lag-max(tested in ver 1.0.2)

image

    /***** Partition level *****/
    this.partitionRecordsLag = new MetricNameTemplate("{topic}-{partition}.records-lag", groupName, 
            "The latest lag of the partition", tags);
    this.partitionRecordsLagMax = new MetricNameTemplate("{topic}-{partition}.records-lag-max", groupName, 
            "The max lag of the partition", tags);
    this.partitionRecordsLagAvg = new MetricNameTemplate("{topic}-{partition}.records-lag-avg", groupName, 
            "The average lag of the partition", tags);`

Ah, so you are saying the partition is embedded in the MBean _name_. Now I think I understand.

If possible it would also be great to have the consumer group as an additional tag in the metrics.

I took a stab at correcting the reason why some Kafka metrics were reporting NaN in e9b3efe, but am still pretty confused about exactly what is coming out of Kafka. I get MBean notifications for partition-level metrics but they always seem to be NaN and I don't see them in jvisualvm. Totally confused about what is going on. Anybody else willing to take a shot at this?

I can give it a shot

Hi @jkschneider

we updated to version 1.1.1 of micrometer. It is better than before but still not correct. We now get metrics per consumer but since a consumer can have multiple topics/partitions assigned, the figures are still not correct. It only takes the first assigned topic/partition of a consumer.

Example:

A consumer works on 2 topic partitions. The lag on the first partition is 10 and the lag on the second partition is 20. The reported value for that consumer is 10 and not 30.

Actually, the values should also not be aggregated but reported independently. To be able to do this, the topic name and partition number must be added as additional labels. Then the reporting works as expected.

Regards, Lars

@jkschneider
Here is a snapshot of jconsole showing an MBean while a consumer application is running:

screenshot_jconsole

As you can see the consumer-fetch-manager-metrics has metrics per partition of a topic in case the consumer is assigned to more than one partition. These should be reported with micrometer.

One more comment: The metric "records-lag" that represents the latest lag of a consumer for a certain topic and partition is the relevant metric. The "records-lag-max" can also be calculated downstream by the corresponding monitoring solution (like prometheus), although it does not hurt if it is also exposed.

If possible it would also be great to have the consumer group as an additional tag in the metrics.

The consumer group is not exposed via the kafka MBeans metrics. One option might be to add the consumer group to your client id maybe?

As mentioned, this is addressed in #1136 . However, I left in comment there since we still have the same error as before and could not get this working even with 1.1.2 that should fix this issue.

See #1136: We will provide a pull request to fix this.

This should be fixed by #1166. Can someone try the 1.1.3 snapshots and confirm?

@shakuzen: Since I provided the mentioned pull request because of this ticket here (which I also faced, as you can see in the discussion further up) I can confirm that it works. If you need somebody else, maybe @fkoehler can confirm this as well.

Thanks @larsduelfer for the fix and the confirmation. I'll close this then. We're planning on releasing 1.1.3 this week with the fix. If anyone finds any additional issue, let us know.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nugnoperku picture nugnoperku  路  4Comments

ITman1 picture ITman1  路  4Comments

Comrada picture Comrada  路  4Comments

fkoehler picture fkoehler  路  3Comments

adrianboimvaser picture adrianboimvaser  路  3Comments