Spring-boot: Actuator is using CPU while app is doing nothing

Created on 12 Dec 2017  路  20Comments  路  Source: spring-projects/spring-boot

Hi there.

If you add the spring-boot-starter-actuator as dependency, after the first http call, the app will consume 6%-7% CPU forever (on a few threads). Even when there is no active http connection nor application background thread working.

If I remove this dependency, the app is idle at 0% as expected.

Edit
Spring Boot version 2.0.0.0M7
Java 8
Archive.zip

The demo app have been generated with https://start.spring.io/

Steps to reproduce the issue
1) Build the app
2) Launch it using java -jar
3) Monitor CPU usage using htop => Nothing special is seen
4) Open a web browser and browse to localhost:8080 => error page is shown
5) Check CPU usage on htop => 6-7% used endlessly by the app

enhancement

Most helpful comment

I found that setting the system property LatencyUtils.useActualTime to false disables the SimplePauseDetector but keeps the rest of micrometer running.

It switches LatencyUtils into a test mode such that TimeServices#sleepNanos will call wait() on a monitor that never fires.

All 20 comments

@dgsqrs What version of Spring Boot are you using? Can you share a small sample that shows the issue?

Hi @philwebb .
I updated my initial post with the requested information.

I can reproduce this. After the first request, the CPU seems to be busy:

after-first-request

VisualVM shows this:
cpu

This behaviour seems to be related with the new metrics support, since excluding the autoconfiguration with @SpringBootApplication(exclude = MetricsAutoConfiguration.class) "fixes" this.

@jkschneider is this related to the GC pause detection or some specific instrumentation? Is the cost constant even when the application is busy or is this just something to keep the VM busy on purpose?

Is this related to the GC pause detection

Yes, I believe so (and other system-level pauses unrelated to GC).

Is the cost constant even when the application is busy

Yes.


I'm not sure if it's possible to improve this mechanism to use no CPU. We could offer a config option to tradeoff monitoring accuracy for CPU consumption, though it's unclear to me what the default should be.

More research is in order. Perhaps there is a way to accomplish accurate pause detection and indiscernible CPU consumption.

My initial feeling is that we should set the default to not consume CPU. I expect that GC pause detection is probably quite an advanced problem for most users and by the time that they need it they'll be quite a long way into their journey with metrics. Constant CPU usage, however, is probably going to get noticed.

tldr: I think this should be configurable and by default off. In the docs we should call it out.

I don't necessarily disagree with the default if the mechanism can't be improved, but this kind of reasoning really troubles me:

by the time that they need it they'll be quite a long way into their journey with metrics

Unless we compensate for pauses, they are basically looking at garbage. This lures folks into a false sense of security, the "blindfold with a lollipop" state of being.

I saw a study once (conveniently can't remember where) that suggested something like 20% of Java apps exhibited periodic pauses of 5 sec or greater. Even if both of these figures were far lower, it's still frightening. If nothing else, I want the default set of metrics people get to guide them to responsible operationalization, e.g. you should probably run more than one instance in prod and have some circuit breaker mechanism to control SLA boundaries.

I understand the temptation to view this as a day 2 problem. It's just not clear to me that many organizations ever even realize there is a problem in the first place.

At a cursory glance, I'm pretty confident we can implement a pause detector that trades off pause-length granularity for apparent CPU consumption. Then we can select a default that still captures well the eye-popping GC pauses.

I found that setting the system property LatencyUtils.useActualTime to false disables the SimplePauseDetector but keeps the rest of micrometer running.

It switches LatencyUtils into a test mode such that TimeServices#sleepNanos will call wait() on a monitor that never fires.

I found that setting the system property LatencyUtils.useActualTime to false

Wow, the hacking prize goes to you today, @osialr :). Thanks for the workaround option.

@jkschneider do you have a micrometer issue that we can track?

Testing the fix this morning. Here is the issue: https://github.com/micrometer-metrics/micrometer/issues/328

Summarizing the changes coming in Micrometer 1.0.0-rc.6:

The pause detector is configurable to an extent via MeterRegistry#config(). The default pause detector will still be a clock-drift detector which needs to wake up periodically to check for drift. By configuring the sleep interval (time between checks), you can control CPU consumption. We've selected a default in rc6 that reduces the CPU consumption to about 0.1% while still providing decent compensation for long pauses.

You can change the intervals on the clock-drift detector like so:

@Bean
MeterRegistryConfigurer registryConfig() {
   return r -> r.config().pauseDetector(new ClockDriftPauseDetector(sleepInt, pauseInt));
}

We also provide an alternate, noop implementation. Unless you are concerned about even the slightest amount of CPU use, this is not recommended:

@Bean
MeterRegistryConfigurer registryConfig() {
   return r -> r.config().pauseDetector(new NoPauseDetector());
}

It's possible that we'll find alternate pause detector implementations in the future. Perhaps the JDK will someday provide an event-based interface for GC events. Or we could probably provide a detector limited to GC events by scraping GC logs.

I've tried another round of testing using the latest Spring Boot snapshots; I'm not really seeing an improvement here and even seeing conflicting results.

Note that I'm looking at CPU usage on the PID line, which is the amount of CPU used by the process last time it was scheduled on a CPU. So "1%" means "1% on the CPU it was scheduled on", not "1% of the host capacity". This might be biased, but I'm trying to ignore other processes running on the host.

Results

Spring Boot and Actuator:
boot-actuator

Spring Boot and Actuator, after first request:
boot-actuator-first-request

Spring Boot and Actuator, NoPauseDetector:
boot-actuator-noop-pause-detector

Spring Boot and Actuator, NoPauseDetector after first request:
boot-actuator-no-pause-detector-first-request

Spring Boot without Actuator:
boot-without-actuator

Spring Boot with Actuator, but Metrics auto-configuration ignored:
boot-actuator-metrics-excluded

Summary

  • "Spring Boot without Actuator" and "Spring Boot with Metrics auto-configuration ignored" are still equivalent.
  • We now see the CPU usage even before the first request, which means some initialization code might have changed. CPU usage seems to decrease after the first request.
  • NoPauseDetector doesn't seem to make a big difference in my tests

There might be something else at play here and we need to look deeper.
Any idea @jkschneider @osialr ?

Fixed a bug in micrometer-core that prevented the configured detector from being used
(https://github.com/micrometer-metrics/micrometer/issues/350).

@jkschneider Is there a SNAPSHOT available that we can try? I'd like to test this before you cut RC8

@philwebb Yes. Snapshots are published on every successful build to repo.spring.io

Thanks @jkschneider. Did you have a PR lined up for this as well? I can't remember what we discussed and if the upgrade is easy or not.

I've hacked a draft PR here.

I'm not sure about a few of those changes, especially in our test suite where we were previously counting measures (if I'm not mistaken). But the test suite is green.

@bclozel Oops, I guess you got a little ahead of me.

I'll be piling more commits into: https://github.com/spring-projects/spring-boot/pull/11776

OK so the update to the snapshot has fixed the issue for me. @dgsqrs if you can give 2.0.0.BUILD-SNAPSHOT a try that would be great.

Was this page helpful?
0 / 5 - 0 ratings