Micrometer: Memory leak in histos in Timers

Created on 2 Aug 2018  路  11Comments  路  Source: micrometer-metrics/micrometer

We use client side percentiles in timers and after running the apps for sometime we see considerable memory usage and finally an OOM on timers. attached is the heap dump leak suspect pic
image
the dump shows about 4GB of accumulated timers.

This is the code we use to create timers.
image

Timers once created are not recreated again but the time recording creates a leak over a few hours of use. we use v1.0.4 of micrometer along with spring boot 2.0.RC6

question

Most helpful comment

That sort of detail is better suited for logs. Metrics are intended to be aggregated

All 11 comments

please try with the latest versions of boot and micrometer, 2.0.4.RELEASE and 1.0.6 respectively.

I don't have that option at hand right now....the app is in pre production. Changing boot bom will need changes to a lot of other things as well including spring cloud. Is there any solution that i can use in 1.0.4 micrometer and boot 2.0.RC6?

There's no such release as spring boot 2.0.0.RC6 https://repo.spring.io/libs-milestone/org/springframework/boot/spring-boot-parent/

Going into production with a release candidate where there is a release seems pretty dangerous.

@vireshwali There isn't really a memory leak here, it's just you are creating a _lot_ of unique tag values. Each unique combination of tags requires its own data structure to compute percentiles. If you can't figure out why you are creating so many tags, you can clamp the max tag cardinality with MeterFilter#maximumAllowableTags.

It really never makes sense to use client side percentiles on a metric with more unique tags than can be visualized in a single chart, since there is no way to aggregate percentiles across dimensions.

thank @jkschneider
So if i disable client side histograms by adding publishPercentileHistogram(enabled) will that prevent this?

I need tags....we use the timers in a scheduled jobs and tags indicate which run of the job had what time stats. So limiting the tags is not correct in my case. Tag here denotes a run of the job. SO if the job runs 100 times a day, it will add 100 tags.

That sort of detail is better suited for logs. Metrics are intended to be aggregated

So if i disable client side histograms by adding publishPercentileHistogram(enabled) will that prevent this?

Tagging with an unbounded tag like job ID is simply going to create memory issues. You can disable percentiles and will get further, but it's still going to catch up with you later. The point of metrics is not to understand how long a particular job took to run but understand the distribution of times for many jobs.

Well we were not actually timing the job runs, but the external endpoints hit during an execution of the job run. But your ideology applies there as well......"understand the distribution of times for many jobs"........so i refactored the code to use a single metric and tag for aggregating times to a single target across all job runs. That should do about it, based on what you guys have explained above.

Sorry to post on a closed thread, but have a small related question. Does the same ideology and stuff apply on counters and gauges also? I mean they are straight values so once posted to the aggregator, the old values should be available for GC. Are they also retained over tag over an extended period of time?

Yes, more tags will take more memory. Counters and Gauges are simpler and won't grow at the same rate, but once a counter of gauge has been created with a given tag it will hang around.

Was this page helpful?
0 / 5 - 0 ratings