Spring-boot: Metric endpoint is summing up same metric

Created on 15 Aug 2018 · 16Comments · Source: spring-projects/spring-boot

We are using Spring Boot 2.0.3 actuators.

We are using MetricAutoConfiguration to use the Micrometer global CompositeMeterRegistry and we have StatsdMeterRegistry and JmxMeterRegistry both in the classpath. We are seeing a count error in the Spring Boot actuator MetricsEndpoint where the value for a single meter counter is being doubled at: https://github.com/spring-projects/spring-boot/blob/80da9cf5ebdc256c0b743fd0c34c451a368530aa/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/metrics/MetricsEndpoint.java#L135

For example, we have a metric for keeping track of 200 HTTP responses and the metric is being added to both StepMeterRegisty objects. We would expect this as the value has been added to the global CompositeMeterRegistry which would be sent to each StepMeterRegistry for publishing. But why is the default behavior in the MetricsEndpoint to sum these values and essentially double the count metric when the metric is the same?

superseded

Source

pulkitmehra

👍1

Most helpful comment

@jkschneider @wilkinsona , thank you for the analysis.

We would like to take this opportunity and fix this bug. We just want to discuss the approach for this fix. In my opinion, a first-wins strategy is a right thing to do here(please confirm). I think if we change the map merge function to ignore the second update will fix it (v1, v2) -> v1.

Also, is MAX necessary here?

private BiFunction<Double, Double, Double> mergeFunction(Statistic statistic) {
        return Statistic.MAX.equals(statistic) ? Double::max : Double::sum;
    }

pulkitmehra on 16 Aug 2018

👍4

All 16 comments

The summing was introduced in https://github.com/spring-projects/spring-boot/commit/bc05352290dccdbd069c9df2a31d24a4ca3b7b18, and is intentional judging by the following portion of the commit message:

Selecting a metric by name (and optionally a set of tags) reports statistics that are the sum of the statistics on all time series containing the name (and tags).

I'll defer to @jkschneider now as he knows way more about metrics than I do.

wilkinsona on 15 Aug 2018

@wilkinsona That is correct!

jkschneider on 15 Aug 2018

Is it a common use case for applications to have multiple StepMeterRegistries? This seems to be the source of the issue.

ryanpconnors on 15 Aug 2018

For example, using CompositeMeterRegistry we are synching the same metric into two different measurement platform systems say, Datadog and JMX. So essentially, these StepMeterRegistry will have the same metric and statistics.

So, what /metric endpoint is doing is getting all the internal registries from CompositeMeterRegistry and querying the same metric name in each internal registry (here its Dtatadog and JMX), and get the same stats from them and summing them up.

The way we understand, the CompositeMeterRegistry aggregates other meter registry and provides a single interface to collect metrics. So, most of the time we will find the same metric and stats in an internal registry.

https://github.com/spring-projects/spring-boot/blob/80da9cf5ebdc256c0b743fd0c34c451a368530aa/spring-boot-project/spring-boot-actuator/src/main/java/org/springframework/boot/actuate/metrics/MetricsEndpoint.java#L112

if (registry instanceof CompositeMeterRegistry) {
            ((CompositeMeterRegistry) registry).getRegistries()
                    .forEach((member) -> collectMeters(meters, member, name, tags));
        }
        else {
            meters.addAll(registry.find(name).tags(tags).meters());
        }

pulkitmehra on 15 Aug 2018

Thank you, @jkschneider.

I guess I'm missing something here as summing the values from all of the registries doesn't make sense to me. To give a concrete example, if I have an app with two meter registries running in a JVM that has 36 live threads, the metrics endpoint tells me that there are 72 live threads:

$ http :8080/actuator/metrics/jvm.threads.live
HTTP/1.1 200
Content-Disposition: inline;filename=f.txt
Content-Type: application/vnd.spring-boot.actuator.v2+json;charset=UTF-8
Date: Wed, 15 Aug 2018 17:09:16 GMT
Transfer-Encoding: chunked

{
    "availableTags": [],
    "baseUnit": null,
    "description": "The current number of live threads including both daemon and non-daemon threads",
    "measurements": [
        {
            "statistic": "VALUE",
            "value": 72.0
        }
    ],
    "name": "jvm.threads.live"
}

There's no indication that this is a multiple of the actual number of live threads. I guess there are also scenarios where it may not be an exact multiple. For example, if a thread is started in between querying two registries.

@jkschneider Can you please help us to understand why summing, rather than, say, a first-wins strategy is the right thing to do?

wilkinsona on 15 Aug 2018

👍1

A concrete example to show this usecase

SpringBoot App

@SpringBootApplication
@RestController
public class SampleApplication {

    @Autowired
    private MeterRegistry meterRegistry;

    Counter success;

    @PostConstruct
    public void init(){
         success = Counter.builder("Success").register(meterRegistry);
    }


    @RequestMapping("/")
    public String index() {
        success.increment();
        return "success";
    }

    public static void main(String[] args) {
        SpringApplication.run(SampleApplication.class, args);
    }

}

application.properties

management.metrics.use-global-registry=true
management.endpoints.web.exposure.include=*
management.endpoints.web.base-path=/actuator
management.endpoint.health.show-details=always
management.server.servlet.context-path=/management

# StatsD Micrometer Config (Datadog)
management.metrics.export.statsd.enabled=true
management.metrics.export.statsd.flavor=datadog
management.metrics.export.statsd.host=localhost
management.metrics.export.statsd.max-packet-length=1400
management.metrics.export.statsd.polling-frequency=PT10S
management.metrics.export.statsd.port=8125
management.metrics.export.statsd.publish-unchanged-meters=true 

# graphite Micrometer Config
management.metrics.export.graphite.duration-units=seconds
management.metrics.export.graphite.enabled=true
management.metrics.export.graphite.host=localhost
management.metrics.export.graphite.port=2004
management.metrics.export.graphite.protocol=pickled
management.metrics.export.graphite.rate-units=seconds
management.metrics.export.graphite.step=1m
management.metrics.export.graphite.tags-as-prefix=test

management.metrics.binders.files.enabled=true
management.metrics.binders.integration.enabled=true
management.metrics.binders.jvm.enabled=true
management.metrics.binders.logback.enabled=true
management.metrics.binders.processor.enabled=true
management.metrics.binders.uptime.enabled=true

pom.xml

<dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>

        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-statsd</artifactId>
            <version>1.0.6</version>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-graphite</artifactId>
            <version>1.0.6</version>
        </dependency>



        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

Now when you hit / endpoint the counter should be incremented to 1 and /actuator/metrics/Success should show it as 1 but its showing as

{
    "name": "Success",
    "description": null,
    "baseUnit": null,
    "measurements": [
        {
            "statistic": "COUNT",
            "value": 2
        }
    ],
    "availableTags": []
}

pulkitmehra on 15 Aug 2018

👍1

Sorry, the intention was to sum across dimensions, not across registries. I would say this is a bug.

jkschneider on 16 Aug 2018

@jkschneider @wilkinsona , thank you for the analysis.

Also, is MAX necessary here?

private BiFunction<Double, Double, Double> mergeFunction(Statistic statistic) {
        return Statistic.MAX.equals(statistic) ? Double::max : Double::sum;
    }

pulkitmehra on 16 Aug 2018

👍4

@jkschneider Could you take a look at the comment above please?

wilkinsona on 11 Sep 2018

I think if you eliminate summing altogether, you'll also inadvertently eliminate dimensional aggregation. The max is necessary for dimensional aggregation. In other words, we shouldn't sum/max metrics across registries, but should sum/max them across dimensions.

jkschneider on 11 Sep 2018

Hi, @jkschneider , thank you for the reply.

A quick follow up question, what do we mean by dimensional aggregation here? A small example will help us getting the logic right.

In other words, we shouldn't sum/max metrics across registries, but should sum/max them across dimensions.

Dimensions: Do you mean same metric name but different tags?

pulkitmehra on 13 Sep 2018

Dimensions: Do you mean same metric name but different tags?

Yes.

wilkinsona on 13 Sep 2018

Thank @wilkinsona ,

This test-case explains it well. Just to vet my thoughts, lets say we have two internal registeries inside CompositeMeterRegistry with the following metrics

Registry1

name:: "cache.result.hit" value: 10 tag: "host", "1"
name:: "cache.result.hit" value: 12 tag: "host", "2"
name:: "cache.result.miss", value:5 tag: "host", "1"
name:: "cache.result.miss", value:5 tag: "host", "2"

Registry2

name:: "cache.result.hit" value: 10 tag: "host", "1"
name:: "cache.result.hit" value: 15 tag: "host", "2" (differs in value)
name:: "cache.result.miss", value:5 tag: "host", "1"
name:: "cache.result.miss", value:5 tag: "host", "2"

Output

name:: "cache.result.hit" value: 25 tag: ["host", "1"] , [ "host", "2"] (Registry2 wins)
name:: "cache.result.miss" value: 10 tag: ["host", "1"] , [ "host", "2"]

Logic

For each internal registry inside CompositeRegistry
sum/max across dimensions.
Update the dimension value only if its greater in other registry

Please confirm whether this logic is ok to implement?

CC: @jkschneider

pulkitmehra on 13 Sep 2018

Everything looks good except that I wouldn't bother trying to determine which registry has the greater summed value. Just pick the first registry in the composite every time and forget the other one.

Remember, unlike in Spring Boot 1.x, /actuator/metrics is _just_ a diagnostic tool. It's output shouldn't be used to actually drive decisions. You really should be relying rather on how each registry exposes metrics to its target backend.

jkschneider on 14 Sep 2018

@jkschneider , please review.

pulkitmehra on 18 Sep 2018

Closing in favor of PR #14497