Spring-boot: Composite HealthIndicator that runs them in parallel

Created on 14 Mar 2015  路  7Comments  路  Source: spring-projects/spring-boot

Right now, invoking the health endpoint lead to a synchronous invocation of all underlying HealthIndicator instance(s) that have been configured. It would be nice to offer an option to run these in parallel instead.

enhancement

Most helpful comment

Real-world example here:

Some services have requirements to respond within a given time limit. This time limit may be set by an external organization, not always under the control of the devs. This kind of timeout gets interpreted as a DOWN state, wakes devs up in the middle of the night, reboots service instances, and all the while provides no context as to which component was experiencing high latency (because there was no HTTP response).

I have successfully added timeouts per-actuator, but with 20 checks that require timeouts of around 5 seconds (any less would cause alerts for minor network blips), we are asking our caller to wait up to 1:40 for a response. At my org, this is absurd, and nobody wants to permit this behavior.

All 7 comments

Definitely in the "nice to have" bucket. I imaging that the health endpoint isn't hit all that often so performance probably isn't a big deal.

@philwebb I had the idea when I saw #2630

I would also like to have support for this. My project got a bunch of health checks and some of them can run up to few seconds. So in total performance of /health endpoint is terrible with sequential approach.

:+1:
Some of my projects have 20+ health checks.
For now we use our homebrew solution to do them in parallel.

Would love to see support for parallel health checks in Spring Boot out of the box.
Or at least a way to inject our own CompositeHealthIndicator.

And it would be nice to include a standard detail information how long a certain health indicator took to execute.

Real-world example here:

Some services have requirements to respond within a given time limit. This time limit may be set by an external organization, not always under the control of the devs. This kind of timeout gets interpreted as a DOWN state, wakes devs up in the middle of the night, reboots service instances, and all the while provides no context as to which component was experiencing high latency (because there was no HTTP response).

I have successfully added timeouts per-actuator, but with 20 checks that require timeouts of around 5 seconds (any less would cause alerts for minor network blips), we are asking our caller to wait up to 1:40 for a response. At my org, this is absurd, and nobody wants to permit this behavior.

I second everything @erikgreif-acc said.

Sometimes I run into a situation in which a time out occurs but it is unclear which of the components caused it. So I'd like to propose a timeout for the parallel execution after which every unresponsive component is considered/reported as down. To keep the existing behavior the default for this timeout should be infinite.

And while working on it, exporting metrics (number of total, successful and failed executions, sum of execution time for successful executions for each component) would make alerting (and inhibiting) on health related issues easier and would help with dissecting problems faster.

Was this page helpful?
0 / 5 - 0 ratings