We are using Spring Boot 2.0.3 actuators.
We are using MetricAutoConfiguration to use the Micrometer global CompositeMeterRegistry and we have StatsdMeterRegistry and JmxMeterRegistry both in the classpath. We are seeing a count error in the Spring Boot actuator MetricsEndpoint where the value for a single meter counter is being doubled at: https://github.com/spring-projects/spring-boot/blob/80da9cf5ebdc256c0b743fd0c34c451a368530aa/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/metrics/MetricsEndpoint.java#L135
For example, we have a metric for keeping track of 200 HTTP responses and the metric is being added to both StepMeterRegisty objects. We would expect this as the value has been added to the global CompositeMeterRegistry which would be sent to each StepMeterRegistry for publishing. But why is the default behavior in the MetricsEndpoint to sum these values and essentially double the count metric when the metric is the same?
The summing was introduced in https://github.com/spring-projects/spring-boot/commit/bc05352290dccdbd069c9df2a31d24a4ca3b7b18, and is intentional judging by the following portion of the commit message:
Selecting a metric by name (and optionally a set of tags) reports statistics that are the sum of the statistics on all time series containing the name (and tags).
I'll defer to @jkschneider now as he knows way more about metrics than I do.
@wilkinsona That is correct!
Is it a common use case for applications to have multiple StepMeterRegistries? This seems to be the source of the issue.
For example, using CompositeMeterRegistry we are synching the same metric into two different measurement platform systems say, Datadog and JMX. So essentially, these StepMeterRegistry will have the same metric and statistics.
So, what /metric endpoint is doing is getting all the internal registries from CompositeMeterRegistry and querying the same metric name in each internal registry (here its Dtatadog and JMX), and get the same stats from them and summing them up.
The way we understand, the CompositeMeterRegistry aggregates other meter registry and provides a single interface to collect metrics. So, most of the time we will find the same metric and stats in an internal registry.
if (registry instanceof CompositeMeterRegistry) {
((CompositeMeterRegistry) registry).getRegistries()
.forEach((member) -> collectMeters(meters, member, name, tags));
}
else {
meters.addAll(registry.find(name).tags(tags).meters());
}
Thank you, @jkschneider.
I guess I'm missing something here as summing the values from all of the registries doesn't make sense to me. To give a concrete example, if I have an app with two meter registries running in a JVM that has 36 live threads, the metrics endpoint tells me that there are 72 live threads:
$ http :8080/actuator/metrics/jvm.threads.live
HTTP/1.1 200
Content-Disposition: inline;filename=f.txt
Content-Type: application/vnd.spring-boot.actuator.v2+json;charset=UTF-8
Date: Wed, 15 Aug 2018 17:09:16 GMT
Transfer-Encoding: chunked
{
"availableTags": [],
"baseUnit": null,
"description": "The current number of live threads including both daemon and non-daemon threads",
"measurements": [
{
"statistic": "VALUE",
"value": 72.0
}
],
"name": "jvm.threads.live"
}
There's no indication that this is a multiple of the actual number of live threads. I guess there are also scenarios where it may not be an exact multiple. For example, if a thread is started in between querying two registries.
@jkschneider Can you please help us to understand why summing, rather than, say, a first-wins strategy is the right thing to do?
A concrete example to show this usecase
SpringBoot App
@SpringBootApplication
@RestController
public class SampleApplication {
@Autowired
private MeterRegistry meterRegistry;
Counter success;
@PostConstruct
public void init(){
success = Counter.builder("Success").register(meterRegistry);
}
@RequestMapping("/")
public String index() {
success.increment();
return "success";
}
public static void main(String[] args) {
SpringApplication.run(SampleApplication.class, args);
}
}
application.properties
management.metrics.use-global-registry=true
management.endpoints.web.exposure.include=*
management.endpoints.web.base-path=/actuator
management.endpoint.health.show-details=always
management.server.servlet.context-path=/management
# StatsD Micrometer Config (Datadog)
management.metrics.export.statsd.enabled=true
management.metrics.export.statsd.flavor=datadog
management.metrics.export.statsd.host=localhost
management.metrics.export.statsd.max-packet-length=1400
management.metrics.export.statsd.polling-frequency=PT10S
management.metrics.export.statsd.port=8125
management.metrics.export.statsd.publish-unchanged-meters=true
# graphite Micrometer Config
management.metrics.export.graphite.duration-units=seconds
management.metrics.export.graphite.enabled=true
management.metrics.export.graphite.host=localhost
management.metrics.export.graphite.port=2004
management.metrics.export.graphite.protocol=pickled
management.metrics.export.graphite.rate-units=seconds
management.metrics.export.graphite.step=1m
management.metrics.export.graphite.tags-as-prefix=test
management.metrics.binders.files.enabled=true
management.metrics.binders.integration.enabled=true
management.metrics.binders.jvm.enabled=true
management.metrics.binders.logback.enabled=true
management.metrics.binders.processor.enabled=true
management.metrics.binders.uptime.enabled=true
pom.xml
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-statsd</artifactId>
<version>1.0.6</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-graphite</artifactId>
<version>1.0.6</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
Now when you hit / endpoint the counter should be incremented to 1 and /actuator/metrics/Success should show it as 1 but its showing as
{
"name": "Success",
"description": null,
"baseUnit": null,
"measurements": [
{
"statistic": "COUNT",
"value": 2
}
],
"availableTags": []
}
Sorry, the intention was to sum across dimensions, not across registries. I would say this is a bug.
@jkschneider @wilkinsona , thank you for the analysis.
We would like to take this opportunity and fix this bug. We just want to discuss the approach for this fix. In my opinion, a first-wins strategy is a right thing to do here(please confirm). I think if we change the map merge function to ignore the second update will fix it (v1, v2) -> v1.
Also, is MAX necessary here?
private BiFunction<Double, Double, Double> mergeFunction(Statistic statistic) {
return Statistic.MAX.equals(statistic) ? Double::max : Double::sum;
}
@jkschneider Could you take a look at the comment above please?
I think if you eliminate summing altogether, you'll also inadvertently eliminate dimensional aggregation. The max is necessary for dimensional aggregation. In other words, we shouldn't sum/max metrics across registries, but should sum/max them across dimensions.
Hi, @jkschneider , thank you for the reply.
A quick follow up question, what do we mean by dimensional aggregation here? A small example will help us getting the logic right.
In other words, we shouldn't sum/max metrics across registries, but should sum/max them across dimensions.
Dimensions: Do you mean same metric name but different tags?
Dimensions: Do you mean same metric name but different tags?
Yes.
Thank @wilkinsona ,
This test-case explains it well. Just to vet my thoughts, lets say we have two internal registeries inside CompositeMeterRegistry with the following metrics
name:: "cache.result.hit" value: 10 tag: "host", "1"
name:: "cache.result.hit" value: 12 tag: "host", "2"
name:: "cache.result.miss", value:5 tag: "host", "1"
name:: "cache.result.miss", value:5 tag: "host", "2"
name:: "cache.result.hit" value: 10 tag: "host", "1"
name:: "cache.result.hit" value: 15 tag: "host", "2" (differs in value)
name:: "cache.result.miss", value:5 tag: "host", "1"
name:: "cache.result.miss", value:5 tag: "host", "2"
name:: "cache.result.hit" value: 25 tag: ["host", "1"] , [ "host", "2"] (Registry2 wins)
name:: "cache.result.miss" value: 10 tag: ["host", "1"] , [ "host", "2"]
Please confirm whether this logic is ok to implement?
CC: @jkschneider
Everything looks good except that I wouldn't bother trying to determine which registry has the greater summed value. Just pick the first registry in the composite every time and forget the other one.
Remember, unlike in Spring Boot 1.x, /actuator/metrics is _just_ a diagnostic tool. It's output shouldn't be used to actually drive decisions. You really should be relying rather on how each registry exposes metrics to its target backend.
@jkschneider , please review.
Closing in favor of PR #14497
Most helpful comment
@jkschneider @wilkinsona , thank you for the analysis.
We would like to take this opportunity and fix this bug. We just want to discuss the approach for this fix. In my opinion, a first-wins strategy is a right thing to do here(please confirm). I think if we change the map merge function to ignore the second update will fix it
(v1, v2) -> v1.Also, is MAX necessary here?