To help you better imagine the use case of using filters to control distribution statistics, consider timers that are instrumented in low-level core libraries (like the instrumentation that now exists in HikariCP and Rabbit). These authors can't make a general-purpose decision about whether it is useful for users of their library to see certain summary stats like client-side percentiles. For example:
Billy is using InfluxDB, which does not have a rich-enough query language to compute aggregable percentiles from histograms. Billy might choose to ship a Micrometer-computed P95, knowing that the usefulness of this is limited to the tag cardinality of the underlying metric. For example, if the timer is tagged with "host" and "result" and his organization only runs 10 service instances where there are 2 possible results, he could plausibly plot a non-aggregable P95 by simply plotting all the individual time series independently and monitoring for outliers.
Sue is using Prometheus and never wants to ship Micrometer-computed percentiles because she has access to histogram_quantile and so the usefulness of her plots doesn't need to be constrained by tag cardinality. She chooses to ship the histogram buckets.
Jane is using Wavefront and doesn't care about percentile distributions at all. She is comfortable monitoring the range between the distribution average (a measure of centrality that is neither better or worse than P50 in the general case) and max. Wavefront publishes a blog about their new aggregable percentiles feature and Jane gets P99 on key latency metrics via an additional property.
Organization has adopted a "instrument all the things" approach but are worried about the scalability of their on-prem Prometheus cluster. If their Prometheus cluster ever gets stressed, they want to be able to drop summary stats on some less-critical latency metrics and perhaps disable whole swaths of other metrics. They want to accomplish both in a property-driven way such that they could ease pressure on their monitoring system across their whole stack with a combination of something like Archaius/Config Server.
There is no way for the Rabbit client author to satisfy all of these needs without shipping wasteful extra time series to each of our imaginary users. Our recommendation for library authors is always just instrument with plain timers and let application developers decide at the last minute which summary statistics they want via filters.
@jkschneider does it have to be property-based though? An application developer may craft the stats that they want based on what is available using a DSL or something? The number of options is dense and the fact that we had to restructure the namespace is a sign that perhaps a DSL would be a better fit?
Ideally it's property based, yes. We want folks to be able to leverage config server to dynamically influence enablement/fidelity of certain metrics in the long term.
the fact that we had to restructure the namespace is a sign that perhaps a DSL would be a better fit?
Perhaps it is just a sign that the original structure was a poor design. After all, I came up with it ;)
See also this comment when looking at the implementation.
@jkschneider I've been thinking some more about this and I'm wondering if something like we do with OAuth2ClientProperties might work? What if we split the properties into two parts? A named definition and then a reference to it?
Something like
metrics.config.light.percentiles=1,2,3
metrics.config.light.somethingelse=forsure
metrics.config.heavy.percentiles=10,20,30
metrics.config.light.somethingelse=noway
metrics.apply.net.foo.bar=light
metrics.apply.net.foo.baz=light
metrics.apply.net.foo=heavy
Would that extra level of indirection be useful?
We could even provide built in configs for common scenarios.
@philwebb Oh interesting. That seems quite useful.
@philwebb I like this very much. Please keep in mind however that the number of options "vary" (so you don't have a single Pojo like in the provider case, it depends on your metric's type if I understand properly).
As such, the reference to a config can be a bit fragile as you can link any metric to any configuration type, including those who aren't legit. That's also the rationale of my DSL idea.
2-part config helps with the more complex percentile settings, but it would be nice to not have to do that just to enable/disable metrics
@checketts the enablement can be a separate setting from that apply thing.
management.metrics.enable.net.foo.bar=false
the number of options "vary"
For certain properties like mimiumExpectedValue, sla that are dimensioned according to the meter's base unit, the configuration will only ever have two variants: those that express a unit of time and those that don't.
We treat time specially because there are so many different expectations around what the base unit of time should be (driven by the fact that the "real" base unit of time, seconds, is such a coarse unit for what we are measuring).
@checketts
2-part config helps with the more complex percentile settings, but it would be nice to not have to do that just to enable/disable metrics
I'm wondering if we can have 'disable' as a built-in config. That way you can do:
management.metrics.filter.net.foo.bar=disable
Hi, I wanted to mention our use case here to see if this issue will cover it.
We are currently using 3 meter registries, JMX (access via Spring Boot Admin), Prometheus (for Grafana dashboards) and CloudWatch (mainly for alarms)
In order to reduce costs and focus on the most crucial alarms (such as 500 status codes) we apply a lot of meter filters to the CloudWatch registry ONLY. These include:
(We also set a single common tag on the CloudWatchMeterRegistry which is the application name - we don't need this on Prometheus or JMX)
Currently we implement this in a shared library with some spring boot auto configuration that creates and configures the CloudWatch meter registry. We have this as a dependency in all our services, although ideally we would just rely on the standard CloudWatch autoconfiguration and use properties from
spring cloud config server to configure the meter filters.
Will whats proposed here cover most if not all of our needs?
@djgeary What I've seen is the complexity of being able to set property values for a single registry via properties becomes its own DSL at one point. For example I made an attempt and the replaceTagValues config, but wanted to make it depend on the meter name. I came up with something like ...filters.meter.name.combine: tag1|allowedValue1,defaultNonMatchingValue;tag2|otherAllowedValue (etc), all the bars, semicolon, comma junk was so specific to my usecase that it didn't make sense releasing upstream.
From what I've followed so far, the default config applies across all registries, and only supports configuring summaries (percentiles/histograms) and excluding meters.
@checketts Yes I realise it may not be possible to do all this via properties, and it can easily become very use case specific.
I just wanted to add what we were doing in case it was useful for whats being designed here.
Specifically the fact that we are only filtering on one MeterRegistry in a CompositeMeterRegistry setup, and the examples of how and why we are filtering.
I agree the replaceTagValues functionality is probably best suited to doing in code (although what we do is copied directly from an example so possibly its quite common and could be a built in option?), but our 'whitelist' mechanism could be possibly be done in properties - our current implementation looks like this in yml:
management:
metrics.export.cloudwatch:
namespace: microservice-testing
enabled: true
whitelist:
- name: http.server.requests
tags:
uri: /lookup,/actions
- name: custom1
- name: custom2
We're going to try to reduce the scope of this one to just to just enablement, percentiles, percentilesHistogram, and SLA.
Woo! Thanks for the heavy lifting @philwebb
@djgeary We'll be thinking more about whitelisting, especially as it relates to CloudWatch, when we get to https://github.com/micrometer-metrics/micrometer/issues/303.