I've an application with a simple Kafka Stream.
I've added latest implementation of KafkaStreamsMetrics.
The loading time of my /prometheus endpoint is very long and consuming a lot of cpu.
I try to troubleshot this and I found this to be really slow :
If I removed the double check, loading time is fine again :
//Double-check if new metrics are registered
checkAndBindMetrics(registry);
With checkAndBindMetrics :
time curl http://kestra:2553/prometheus
real 0m6,209s
user 0m0,015s
sys 0m0,001s
Without checkAndBindMetrics :
time curl http://kestra:2553/prometheus
real 0m0,079s
user 0m0,013s
sys 0m0,013s
Since in Kafka metrics, there is a lot of gauge with a lot of tags (metrics x topic x partition x thread, 300 for my application), I think we must not do this check or used a cache if necessary (didn't see any impact removing this check), this check will fetch all the metrics and use a synchronized methods and will be really slow, since call for every metrics.
Thank you for trying out the new KafkaStreamsMetrics, and sorry for the performance issue. We'll take a look at improving this. /cc @jeqo
Thanks @tchiotludo. Yes, keeping Kafka metrics registry updated has been one of the most challenging parts when implementing this instrumentation. This double check is in place to keep the registry up to date when metrics change (e.g. when sending data to/consuming data from a new topic).
This check is under the assumption that the number of metrics will stabilize at some point (e.g. an application sending data to some topics will register all metrics when sending at least one message to all topics it's suppose to, same for consumer). Although this is not optimal, with current API this is the strategy to keep metrics updated.
Could you share a bit more about your use-case? e.g. how many stream-processors/producers/consumers are you instrumenting? how many topics are been used by clients and often this changes? are you getting this performance penalty on every call or only once?
Hello.
It's a simple stream application with this topology (no other process involve except the stream).
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [RAW-ESB-RECEIPTSYNC-V1])
--> KSTREAM-PEEK-0000000001
Processor: KSTREAM-PEEK-0000000001 (stores: [])
--> KSTREAM-MAPVALUES-0000000002
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-MAPVALUES-0000000002 (stores: [])
--> KSTREAM-MAPVALUES-0000000003
<-- KSTREAM-PEEK-0000000001
Processor: KSTREAM-MAPVALUES-0000000003 (stores: [])
--> KSTREAM-FILTER-0000000004
<-- KSTREAM-MAPVALUES-0000000002
Processor: KSTREAM-FILTER-0000000004 (stores: [])
--> KSTREAM-TRANSFORM-0000000005
<-- KSTREAM-MAPVALUES-0000000003
Processor: KSTREAM-TRANSFORM-0000000005 (stores: [])
--> KSTREAM-PEEK-0000000006
<-- KSTREAM-FILTER-0000000004
Processor: KSTREAM-PEEK-0000000006 (stores: [])
--> KSTREAM-MAPVALUES-0000000010, KSTREAM-SINK-0000000007
<-- KSTREAM-TRANSFORM-0000000005
Processor: KSTREAM-MAPVALUES-0000000010 (stores: [])
--> KSTREAM-TRANSFORMVALUES-0000000011
<-- KSTREAM-PEEK-0000000006
Processor: KSTREAM-TRANSFORMVALUES-0000000011 (stores: [])
--> KSTREAM-SINK-0000000012
<-- KSTREAM-MAPVALUES-0000000010
Sink: KSTREAM-SINK-0000000007 (topic: RAW-SALES-TICKET-V1)
<-- KSTREAM-PEEK-0000000006
Sink: KSTREAM-SINK-0000000012 (topic: EDM-SALES-CUSTOMERRECEIPT-V1)
<-- KSTREAM-TRANSFORMVALUES-0000000011
Sub-topology: 1 for global store (will not generate tasks)
Source: KSTREAM-SOURCE-0000000008 (topics: [EDM-OFFERS-PRODUCT-V1])
--> KTABLE-SOURCE-0000000009
Processor: KTABLE-SOURCE-0000000009 (stores: [product_join])
--> none
<-- KSTREAM-SOURCE-0000000008
So : 5 topics with 24 partitions.
It's always the same topics, no changed at all on this.
The performance penalty is on all call, not only on first call.
In my app I've 258 gauge generated (x number of tags) = 8910 generated metrics from this kafka stream.
After analyse the checkAndBindMetrics(registry); is called on every collection of the gauge for every gauge, on all hit to the prometheus page (so 8910 called in my case)
since checkAndBindMetrics will collect all metrics and forEach all of them in a synchronized way, it mean that lead to foreach millions of entry at the end of the prometheus call.
thanks @tchiotludo ! this sounds like a bug, will look into it..
@tchiotludo calling checkAndBindMetrics is not the issue but passing by the first validation on every call is:
Map<MetricName, ? extends Metric> metrics = metricsSupplier.get();
if (!currentMeters.equals(metrics.keySet())) { // if this is true on every call then it's a bug
If the set of meter names is different, then it attempts to synchronize values.
I've created a branch with a flag to test this doesn't happen on every call: https://github.com/micrometer-metrics/micrometer/compare/master...jeqo:test-kstreams
Would be possible for you to test against this branch or create a test-case to validate your scenario?
Thanks @jeqo I was just starting looking to find a way to reproduce the issue with a unit test.
(don't have find for now)
There is no changed in performance with that branch.
To be honest, I can understand why changing a simple bool value would change the behavior, or I don't understand the change ?
I've look deeper the checkAndBindMetrics and you are true with this part : // if this is true on every call then it's a bug
At the start of the application, there is many case where where metrics changed and after it more stable and don't enter the loop.
I try to have a profile :

Here we can see that the application if take at least 75% of the time on method :
Map<MetricName, ? extends Metric> metrics = metricsSupplier.get();
since this method is called for every gauge, it's here the bottleneck IMO.
If you look inside this one, for every gauge collect (a lot for a kafka stream and a prometheus integration that displayed all gauge), Kafka client create an hashmap that seems to take a lot of time:

The flamegraph :

IMO, you should find a way to do not call this method for every metric collect, maybe in a background thread ?
There is no changed in performance with that branch.
To be honest, I can understand why changing a simple bool value would change the behavior, or I don't understand the change ?
No change in performance intended, just added a flag and an IT for Kafka Streams to validate things :)
Thanks for the profiling!
IMO, you should find a way to do not call this method for every metric collect, maybe in a background thread ?
Agree, this will be a better way to deal with this.
I made a simple draft PR about that.
Tell me what to you think about.
In my case, it fix performance issue and I seems to update tags properly.
In the unit test, I need to add an extra thread sleep, since new metrics are available only when the background task was running (as I understand).
I've a bigger Kafka Application that have some overhead with this metrics, but seems startup & warmup is very long (many producer / consumer / stream on the same application).
Maybe we have to deal that this no concurrent running of the background tasks ?
This allow us to remove the synchronized (this) on checkAndBindMetrics and simply skip a concurrent access on scheduler.scheduleAtFixedRate runnable ?
Wait for your opinion :thinking:
Forgot the concurrent part, I discovered that no concurrent call is possible :
https://stackoverflow.com/a/35498230/1590168
This does not mean that there will not no concurrency over all scheduled tasks. Rather, for each task (created by invocation of scheduleAtFixedRate), the Runnable only executes on one thread at a time - even if the execution time overruns the interval.
So we can remove the synchonized without risk
@tchiotludo thanks! we can follow up on the PR then.