Nomad: Nomad does not move jobs out of client node with failed docker driver (service stopped)

Created on 19 Jan 2018 · 3Comments · Source: hashicorp/nomad

Nomad version

0.7.1

Operating system and Environment details

CentOS 7, Docker version 17.09.0-ce, build afdb6d4
Running on a 4 node cluster (8 CPU, 64GB RAM each)

Issue

Nomad, when docker daemon is stopped on client node, does not try to move tasks out to another node.
I do have a attrs.driver.docker = 1 constraint, nomad properly recognized driver failure (at least logs this) but tries to restart taks on same node over and over again.

Either move tasks and try to start elsewhere, or maybe add client health check that will set attrs.driver.docker = 0 so constraints can kick in? (and shouldn't driver constraint be automatic, since nomad knows what driver is to be used?)

42 seconds | 0 seconds | Restarting | Restart within policy |   | 0
-- | -- | -- | -- | -- | --
25 seconds | 16 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
25 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
25 seconds | 0 seconds | Restarting | Exceeded allowed attempts, applying a delay |   | 0
16 seconds | 9 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
16 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
16 seconds | 0 seconds | Restarting | Restart within policy |   | 0
0 seconds | 15 seconds | Driver | Downloading image sorintlab/stolon:master-pg9.6 |   | 0
0 seconds | 0 seconds | Driver Failure | failed  to initialize task "sentinel-service" for alloc  "cd68c665-049e-70e2-e75b-65aa25372ed0": Failed to pull  `sorintlab/stolon:master-pg9.6`: dial unix /var/run/docker.sock:  connect: no such file or directory |   | 0
0 seconds | 0 seconds | Restarting | Restart within policy

When docker daemon is started again thise tasks are started. But with docker failure on one client node nomad does not met task count (but it could)

themdrivedocker themscheduling typenhancement

Source

Garagoth

👍1

Most helpful comment

Hi, thanks for opening this issue. To add to what @preetapan mentioned, we will also add in Nomad 0.8 the concept of ongoing driver health checks, so that if a driver fails, the client will stop advertising this driver until it becomes healthy again.

chelseakomlo on 19 Jan 2018

👍4

All 3 comments

@Garagoth Thanks for reporting this. We are addressing rescheduling of failed allocations in the upcoming Nomad 0.8 release. Reschedule attempts and time intervals will be made configurable as well.