Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)
In our environment we have hosts where docker not installed, an we observed that nomad may place tack with docker driver on nodes where docker driver not present
Here is example for node with ID 8be15688 (there is only exec,raw_exec drivers presnt on it)
ruslan@ip-172-30-0-53:~$ nomad status 8be15688
ID = 8be15688
Name = ip-172-30-2-131
Class = openresty
DC = aws
Drain = false
Eligibility = eligible
Status = ready
Uptime = 2178h36m4s
Driver Status = exec,raw_exec
Node Events
Time Subsystem Message
2019-01-30T17:22:22+03:00 Cluster Node reregistered by heartbeat
2019-01-30T17:22:22+03:00 Cluster Node heartbeat missed
2018-12-27T22:03:12+03:00 Cluster Node registered
Allocated Resources
CPU Memory Disk IOPS
800/5900 MHz 1.9 GiB/3.4 GiB 300 MiB/14 GiB 0/0
Allocation Resource Utilization
CPU Memory
1109/5900 MHz 743 MiB/3.4 GiB
Host Resource Utilization
CPU Memory Disk
1507/6000 MHz 1.3 GiB/3.6 GiB 1.6 GiB/15 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
307d08bd 8be15688 dynamotoelastic 22 run failed 6m46s ago 6m8s ago
0f374e3c 8be15688 nginx 2 run running 3mo18h ago 2mo27d ago
As you can see allocation 307d08bd have failed state, with follow events for it
Recent Events:
Time Type Description
2019-03-28T16:32:01+03:00 Not Restarting Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2019-03-28T16:32:01+03:00 Driver Failure failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:32:01+03:00 Driver Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:43+03:00 Restarting Task restarting in 18.228554529s
2019-03-28T16:31:43+03:00 Driver Failure failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:31:43+03:00 Driver Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:28+03:00 Restarting Task restarting in 15.223446572s
2019-03-28T16:31:28+03:00 Driver Failure failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:31:28+03:00 Driver Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:28+03:00 Task Setup Building Task Directory
this allocation require docker driver and it is logical that it could not run on the node without Docker, finally nomad reschedule policy can place task on right node, but i think that is fully wrong that it even not must trying to place task with docker dependency on node without docker driver
You're absolutely right that this should not happen! Thanks for the report!
Classifying this as a client bug for now as my suspicion is that it's a bug when reregistering the node after it missed a heartbeat.
Does this only occur on nodes with "heartbeat missed" node events? Do you have client logs from one of the time periods when the heartbeat was missed?
@schmichael I doesn't think that this is dues missed heatbeats. On node this happens at 30 January 2019
2019-01-30T17:22:22+03:00 Cluster Node heartbeat missed
But when we observe this issue was 28 March 2019, so between this event about 2 month of difference
And here logs from client 8be15688(where allocation was placed and where no docker driver). We took time from 2019-03-28 15:00 which i think enough, to capture events, that was thrown when issue happened
ruslan@ip-172-30-2-131:~$ sudo journalctl -u nomad -S"2019-03-28 15:00"
-- Logs begin at Fri 2019-03-08 07:37:08 MSK, end at Fri 2019-03-29 00:11:04 MSK. --
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:31:28.338687 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:31:28.339683 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.464+0300 [DEBUG] plugin: starting plugin: path=/opt/nomad/nomad_0.8.4/nomad args="[/opt/nomad/nomad_0.8.4/nomad executor
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.466+0300 [DEBUG] plugin: waiting for RPC address: path=/opt/nomad/nomad_0.8.4/nomad
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.565+0300 [DEBUG] plugin.nomad: plugin address: timestamp=2019-03-28T16:31:28.565+0300 address=/tmp/plugin240231881 networ
Mar 28 16:31:43 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:31:43.563837 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:31:43 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:31:43.563867 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:32:01.793287 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019/03/28 16:32:01.793323 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:32:01.811+0300 [DEBUG] plugin.nomad: 2019/03/28 16:32:01 [ERR] plugin: plugin server: accept unix /tmp/plugin240231881: use of
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:32:01.812+0300 [DEBUG] plugin: plugin process exited: path=/opt/nomad/nomad_0.8.4/nomad
@tantra35 Interesting, thanks for posting that. Can you grep for other log lines containing "docker" -- especially directly before the Docker task was scheduled on the node?
Have these hosts ever had Docker installed?
@schmichael we hev actual logs from 08-03-2019 (8 March 2019)
ruslan@ip-172-30-2-131:~$ sudo journalctl -u nomad --no-pager
-- Logs begin at Fri 2019-03-08 07:37:08 MSK, end at Fri 2019-03-29 14:25:12 MSK. --
too old logs already rotated, and there only messages contains "docker" that I mention before, no other.
And on those host docker never was been installed
@tantra35 It fear like #6227 might be different. Would you be able to confirm dynamotoelastic taskgroup allocation had multiple tasks requiring exec or raw_exec driver?
Most helpful comment
@schmichael I doesn't think that this is dues missed heatbeats. On node this happens at 30 January 2019
But when we observe this issue was 28 March 2019, so between this event about 2 month of difference
And here logs from client
8be15688(where allocation was placed and where no docker driver). We took time from2019-03-28 15:00which i think enough, to capture events, that was thrown when issue happened