I was attempting to use the new Kafka meter binders introduced in 1.4.0 and ran into a problem. The application has multiple Kafka producers and consumers, and uses a MeterFilter to apply common tags to all meters. The behavior I'm seeing is that meters only get registered for _one_ consumer and _one_ producer.
I traced the problem to checkAndBindMetrics, specifically where the code removes/ignores meters with "less tags". The first Kafka client works fine, but all meters for any subsequent client will not be registered because they won't yet have the common tags applied by the MeterFilter, thus will have "less tags".
When I comment out this code, all client meters are registered as expected:
//Kafka has metrics with lower number of tags (e.g. with/without topic or partition tag)
//Remove meters with lower number of tags
// boolean hasLessTags = false;
// for (Meter other : registry.find(meterName).meters()) {
// List<Tag> tags = other.getId().getTags();
// if (tags.size() < meterTags.size()) registry.remove(other);
// // Check if already exists
// else if (tags.size() == meterTags.size())
// if (tags.equals(meterTags)) return;
// else break;
// else hasLessTags = true;
// }
// if (hasLessTags) return;
I'm sure this code was added for good reason, I'm just not clear on why meters with less tags need to be removed, so I'm hesitant to submit this in a PR.
/cc @jeqo for his thoughts.
Slight tangent: in the course of this investigation I read through several issues & PRs and just wanted to say that I recognize that Kafka client metrics are complicated, and that I appreciate the effort being put forth in this project to try to get it as right as possible -- adding support for producer and streams clients, and getting away from JMX. I'm also following PR #1173 re: Kafka client binders using a Kafka metric reporter.
Thank you for trying out the new Kafka metrics implementation and sorry you're having an issue with it.
The reason for avoiding registering meters with less tags is because the Kafka client will return metrics e.g. per partition and overall. This does not play well with PrometheusMeterRegistry due to having the same name but a different set of tags. If we register the fine-grained meters and not the summary one, we do not lose anything as a summary can be made by aggregating the more fine-grained meters. Hence we opt to register the more fine-grained meters (those with more tags, e.g. partition).
@snagafritz thanks! @shakuzen is right on why we are removing metrics with less tags, but I think you're right about the issue with MeterFilter. There are a couple of issues as I can see:
If registry injects common tags that are not observed by KafkaMetrics then validating less tags will fail, as registry is shared between binders. @shakuzen is there a way to get this tags available via registry?
As registry is shared between clients, one client meters are compared with others. One way to solve this would be to filter out meters based on client-id when comparing, this way validation only happens between meters from the same client.
@snagafritz I created #1978 as an attempt to fix the second issue, only considering meters from same client when filtering out based on client-id and kafka-version.
Still not sure how to approach the first one, when tags registered via MeterFilter tho
Let me know if it fixes partially your issue.
is there a way to get this tags available via registry?
Just to answer the question: Common tags are in the end a MeterFilter, so we don't keep track of them or make a way to retrieve the configured common tags. I think there was a proposal at some point to make that available.
It could be hacked around by registering a dummy meter without any tags, checking what tags the Id ends up with, and then removing the dummy meter. If we knew a meter that would be registered and its (non-common) tags, we could use that too, but I guess we don't have a guarantee of that.
@snagafritz Please try out the 1.4.2-SNAPSHOT version if you have a chance, to check that this is fixed for you before we release it.
Will do, asap.
Confirmed this has addressed the original issue. Thank you @shakuzen and @jeqo !
Thanks for the confirmation