Nomad: Service maintenance mode check removed by nomad agent from consul

Created on 27 Jul 2018  路  2Comments  路  Source: hashicorp/nomad

Nomad & Consul version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
Consul v1.2.0 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Operating system and Environment details

Server - Windows 2016 Datacenter, run on VM
Agent - Windows 2016 Datacenter, inside container microsoft/windowsservercore:latest.

Issue

Service maintenance mode check removed by nomad agent from consul. It is probably a bug related to #4170.

We have the following setup:
nomad agent and consul agent are installed on the same VM.
When running nomad job, it is successfully registered a service and it's health checks and it is visible from consul agent and server.
Then, we switch the service registered by nomad into maintenance mode, by directly calling the agent api (or running by command line).

consul maint  -enable -service=:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw -reason "Testing"

For some time the service is shown in consul as it should be - in maintenance mode, then it is switched back to normal state.
Studying the logs of both nomad and consul agents show explicitly that consul agent receives a request from localhost to de-register service maintenance health check:

018/07/27 10:44:12 [INFO] agent: Service "_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" entered maintenance mode
2018/07/27 10:44:12 [DEBUG] agent: Service "_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" in sync
2018/07/27 10:44:12 [DEBUG] agent: Check "_service_maintenance:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw" in sync
2018/07/27 10:44:12 [DEBUG] http: Request PUT /v1/agent/service/maintenance/_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw?enable=true&reason=Testing (42.0073ms) from=127.0.0.1:62209
...
...
2018/07/27 10:44:40 [DEBUG] http: Request PUT /v1/agent/check/deregister/_service_maintenance:_nomad-task-7jnnnudilvhc7up4z6yjvm2vjwx576jw (21.0081ms) from=127.0.0.1:52544

And it looks like this is triggered by nomad agent:
Corresponding long from nomad agent:

2018/07/27 10:44:39.764948 [DEBUG] http: Request GET /v1/agent/health?type=client (997.6碌s)
2018/07/27 10:44:40.152101 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 1 checks

I have a feeling that nomad should not remove maintenance mode checks from services in consul in this case, though the rest should be synced as it works now according to #4170 .

The tests also showed that If node as a whole is set to maintenance, then it remains in this state until explicitly removed from maintenance mode.

P.S. There is no any traces at all in consul server and nomad servers logs.

Reproduction steps

  1. Run consul agent and nomad agent on the same VM. Point nomad agent to consul on localhost: consul-address=127.0.0.1:8500. It is not matter will consul run in dev mode or consul agent will connect to the server.
  2. Run nomad job with service registration
  3. Set the registered (by nomad) service into maintenance mode
  4. After some time the maintenance mode health-check is automatically removed

Nomad Client logs (if appropriate)

2018/07/27 10:44:39.764948 [DEBUG] http: Request GET /v1/agent/health?type=client (997.6碌s)
2018/07/27 10:44:40.152101 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 1 checks
themconsul typbug

Most helpful comment

@i-prudnikov Thanks for the details and reproduction steps, I confirmed this behavior as well.

As part of #4170 we made an assumption that any checks registered on behalf of Nomad tasks are only created and managed by Nomad, so we remove extraneous checks that Nomad is not aware of. This plays badly with maintenance mode, which caused the behavior you saw.

We'll fix this in an upcoming release for maintenance mode to work. In general, any out of band registered checks for services that Nomad manages should still get removed. i.e if you want to register any checks, use the service stanza in Nomad to do so. Maintenance mode is a special case so we will fix that.

All 2 comments

@i-prudnikov Thanks for the details and reproduction steps, I confirmed this behavior as well.

As part of #4170 we made an assumption that any checks registered on behalf of Nomad tasks are only created and managed by Nomad, so we remove extraneous checks that Nomad is not aware of. This plays badly with maintenance mode, which caused the behavior you saw.

We'll fix this in an upcoming release for maintenance mode to work. In general, any out of band registered checks for services that Nomad manages should still get removed. i.e if you want to register any checks, use the service stanza in Nomad to do so. Maintenance mode is a special case so we will fix that.

@preetapan thank you for fast reply! Will wait for next nomad release.

Was this page helpful?
0 / 5 - 0 ratings