Consul: Healthcheck should be checked immediately after service registered

Created on 30 Aug 2016 · 8Comments · Source: hashicorp/consul

Healthcheck is now checked with some delay so service after registration appear as unhealthy when actually it's healthy. To solve this problem health should be checked just after service is registered.

themhealth-checks typenhancement

Source

janisz

Most helpful comment

Would it be possible to perform the healthcheck _before_ the service appears in the registry, so its initial status reflects the true status of the service?

kshep on 30 Aug 2016

👍2

All 8 comments

Would it be possible to perform the healthcheck _before_ the service appears in the registry, so its initial status reflects the true status of the service?

kshep on 30 Aug 2016

👍2

Hi @janisz and @kshep, some health checks can be expensive, and multiple services might get registered all in one go, such as with a consul restart, so we always stagger them to try to randomize their phase. You can set the initial health state, so you can default healthy or unhealthy until the first check. Having it not show up until it has run the first check is an interesting idea. It's practically the same as starting it unhealthy, but I could see how you might care about that if you have monitoring set up.

slackpad on 21 Sep 2016

You can set the initial health state, so you can default healthy or unhealthy until the first check.

How to set initial health state?

janisz on 21 Sep 2016

There's a status field described in the "Initial Health Check Status" section of the checks guide:

By default, when checks are registered against a Consul agent, the state is set immediately to "critical". This is useful to prevent services from being registered as "passing" and entering the service pool before they are confirmed to be healthy. In certain cases, it may be desirable to specify the initial state of a health check. This can be done by specifying the status field in a health check definition, like so:

{
  "check": {
    "id": "mem",
    "script": "/bin/check_mem",
    "interval": "10s",
    "status": "passing"
  }
}

The above service definition would cause the new "mem" check to be registered with its initial state set to "passing".

slackpad on 21 Sep 2016

👍1

You can set the initial health state, so you can default healthy or unhealthy until the first check. Having it not show up until it has run the first check is an interesting idea.

It'd be slick if there were either a fourth(?) initial health status like 'check' that indicated the corresponding service shouldn't be registered until a check is performed.

It's practically the same as starting it unhealthy, but I could see how you might care about that if you have monitoring set up.

That's exactly our use case.

kshep on 21 Sep 2016

This is related to #2450, my take on that when new healthcheck is added consul should keep the same state of the service until healthcheck is performed. Perhaps this could be implemented by having initial state "unknown"? that makes consul not take it into account when determining service state. Of course this is only when check was added while consul was running, otherwise old behavior is fine.

takeda on 30 Nov 2016

Btw, I've found that initial status is not always respected. I set initial status to "passing"

service := &api.AgentServiceRegistration{
    Name: c.groupName,
    Port: 50511,
    Checks: api.AgentServiceChecks{
        &api.AgentServiceCheck{
            HTTP:     "http://127.0.0.1:50513/healthcheck",
            Interval: "10s",
            Status:   "passing",
        },
    },
}
err := agent.ServiceRegister(service)

Randomly service starts in "critical" state until first healthcheck is ran (actually that happens more often than not). Tried on 0.7.3.

kanekv on 2 Feb 2017

I have a script check for which the frequency can be very low (either because the check is expensive or because we don't immediate feedback), for example a check on a disk array. If I set the interval to 1hour, then during 1 hour after startup my check will critical (or Passing if I use the status field, but in my case the use case is that after boot the user will check the health).
What I would need is a way to specifiy an interval + a way to express that the first check should be done immediately and not at the end of the first interval.