Nomad: Register Consul Service and Checks atomically

Created on 5 Mar 2018  ·  4Comments  ·  Source: hashicorp/nomad

Context

We use Nomad with Consul and Docker to deploy Grpc services to a cluster. We use Linkerd as the loadbalancer, it uses Consul to get the availability information of the services.

Scenario

We observe transient errors at the consumer when a new instance of the service is getting started during the service update with Nomad.
(Tcp health check is used to avoid routing request to it when the Docker container has started and the service inside is not ready yet to receive requests. Linkerd is configured to only consider services with passing health check status)

Expected behavior

Being able to update the services seamlessly with the above setup.

Versions

Nomad v0.6.3
Consul v1.0.3
Linkerd v1.3.5

Analysis

Linkerd uses blocking queries to get availability info from Consul.
Example: /v1/health/service/testservice?index=1366&dc=dc1&passing=true
I captured the network traffic and this type of query returns the new service instance without the Tcp health check being defined first, then shortly afterwards it returns it the with Tcp health check.
So it seems Consul considers the service “passing” when it does not have health checks defined,
and Nomad first registers the new service instance and then it registers its health checks in a separate call.
The captured network packets (Nomad -> Consul) confirms that it happens separately:

PUT /v1/agent/service/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 165
Accept-Encoding: gzip
{"ID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Name":"testservice","Port":27123,"Address":"10.0.75.1","Check":null,"Checks":null}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8
PUT /v1/agent/check/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 233
Accept-Encoding: gzip
{"ID":"942d6c5b20bea9520f70ced61336a2987bf9c530","Name":"service: \"testservice\" check",
"ServiceID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Interval":"10s","Timeout":"2s","TCP":"10.0.75.1:27123"}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

Nomad should register the service and its health checks in one call in Consul,
otherwise the new service instance is considered healthy even before Nomad registers its health check, I believe.

themconsul typenhancement

Most helpful comment

Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release.

All 4 comments

we use linkerd with nomad too, it works fine

1) https://www.nomadproject.io/docs/job-specification/service.html#initial_status set initial state to critical, shouldn't be healthy until your app says so
2) put something like 5-10s in your task shutdown_delay https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay - so linkerd forgets about the alloc before the kill signal is even sent to the process
3) make sure your task kill_signal ( https://www.nomadproject.io/docs/job-specification/task.html#kill_timeout ) is sufficiently long for your task to drain on-going results
4) make sure your client config has a high enough max kill timeout https://www.nomadproject.io/docs/agent/configuration/client.html#max_kill_timeout
5) success

@jippi Thanks for your help. In our case the shutdown part is not the problem, we confirmed it works fine (we use shutdown_delay, ...). The problem is the new instance is considered healthy before the health check gets registered. We can almost always reproduce that under heavy load.
I believe that setting initial_status to critical will not change this for two reasons:

  1. This is a property on the health check itself, and our problem is Nomad registers the health check separately bit later.
  2. If you check the second PUT request (in the first comment) that Nomad sends to Consul, you can see that we do not specify the initial_status in the job specification, and in that case it defaults to critical on Consul side

Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release.

Did you ever find a work-around for this issue @gahebe? We have the same problem, and can't afford the downtime on every deployment.

Was this page helpful?
0 / 5 - 0 ratings