Nomad v0.5.0
Ubuntu 16.04 x86_64
Docker version 1.12.1, build 23cf638 (apt install docker.io)
When interrupting tasks on a client system, by rebooting or stopping Nomad and Docker services, the tasks fail to restart due to Nomad failing to start the tasks as it finds the previous container still present.
12/12/16 15:21:54 UTC Driver Failure failed to start task 'mytask' for alloc 'cb2604b0-fafe-a05b-66f0-484caedba5ce': Failed to create container: container already exists
docker rm $(docker ps -aq) allows them to start again.For obvious reasons, adding a script that removes all Docker images on boot would not be a good solution.
Nomad Server or Client logs do not contain anything relevant to this error.
Related to parts of the discussion on #2016
Can you try RC2: We made quite a few improvements to the docker driver attempting to remedy this issue: https://releases.hashicorp.com/nomad/0.5.1-rc2/
Thanks ! Just tried and I could not reproduce the issue with RC2.
Having the same core issue, although scenario is a bit different:
12/22/16 16:27:43 CET Restarting Task restarting in 25.881394077s
12/22/16 16:27:43 CET Driver Failure failed to start task 'traefik' for alloc '06fecd6d-81d9-7a16-5c82-c770743d68d8': Failed to create container: container already exists
Didn't help:
# nomad --version
Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')
# nomad status
ID Type Priority Status
jenkins-master service 50 dead
nexus service 50 dead
registry service 50 dead
selenium-chrome service 50 dead
selenium-hub service 50 dead
traefik system 60 running
# nomad status -verbose traefik
ID = traefik
Name = traefik
Type = system
Priority = 60
Datacenters = amersfoort
Status = running
Periodic = false
Summary
Task Group Queued Starting Running Failed Complete Lost
loadbalancing 0 3 0 8 0 0
Evaluations
ID Priority Triggered By Status Placement Failures
34a26bf4-4147-4c25-0b8e-09881eabd0e0 60 job-register complete false
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
d3b6a7d3-a436-f6ca-d470-f672fb164099 34a26bf4-4147-4c25-0b8e-09881eabd0e0 cdaaa02d-40ee-d341-197b-7eee724babfb loadbalancing run pending 12/22/16 15:02:21 CET
f73affb6-6173-07b6-4e1a-1c80bcc6cd3c 34a26bf4-4147-4c25-0b8e-09881eabd0e0 ece479e0-1740-aea8-88d7-5f93c57696fc loadbalancing run pending 12/22/16 15:02:21 CET
06fecd6d-81d9-7a16-5c82-c770743d68d8 34a26bf4-4147-4c25-0b8e-09881eabd0e0 ee191d9f-509f-afc7-c096-54fc7c10c8bb loadbalancing run pending 12/21/16 17:02:59 CET
# nomad alloc-status -verbose d3b6a7d3-a436-f6ca-d470-f672fb164099
ID = d3b6a7d3-a436-f6ca-d470-f672fb164099
Eval ID = 34a26bf4-4147-4c25-0b8e-09881eabd0e0
Name = traefik.loadbalancing[0]
Node ID = cdaaa02d-40ee-d341-197b-7eee724babfb
Job ID = traefik
Client Status = pending
Client Description = <none>
Created At = 12/22/16 15:02:21 CET
Evaluated Nodes = 1
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 16.788µs
Failures = 0
Task "traefik" is "pending"
Task Resources
CPU Memory Disk IOPS Addresses
500 MHz 128 MiB 0 B 0 http: <IP>:9999
ui: <IP>:9998
Recent Events:
Time Type Description
12/22/16 16:55:16 CET Restarting Task restarting in 30.382471444s
12/22/16 16:55:16 CET Driver Failure failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:51 CET Restarting Task restarting in 25.16167252s
12/22/16 16:54:51 CET Driver Failure failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:25 CET Restarting Task restarting in 25.722048638s
12/22/16 16:54:25 CET Driver Failure failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:58 CET Restarting Task restarting in 27.098521822s
12/22/16 16:53:58 CET Driver Failure failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:29 CET Restarting Task restarting in 29.045220368s
12/22/16 16:53:29 CET Driver Failure failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
Placement Metrics
* Score "cdaaa02d-40ee-d341-197b-7eee724babfb.binpack" = 2.587686
I installed new version, removed old containers, nomad rescheduled everything, then I did redeployment of the whole cluster and nomad can't scedule them again.
@mlushpenko What do you mean redeploy the whole cluster?
@dadgar our pipeline invokes Ansible playbooks that deploy consul in the beginning, then nomad, dnsmasq and docker to several VMs.
When we deploy services like Gitlab, we invoke another pipeline via Jenkins that deploys containers to nomad cluster. But sometimes we need to update the cluster itself: docker/consul/nomad, then we invoke the "base" pipeline, but encounter issues mentioned above. Redeploy the whole cluster = invoke "base" pipeline.
@mlushpenko When you run that base pipeline are you doing in-place upgrades of Nomad/Docker or starting a new VM?
Stopping the docker engine is not really advisable.
@dadgar in-place upgrades - within client's legacy infrastructure spinning up VMs on-demand is not an option.
Docker is being restarted and reloaded if there are some changes to Docker config. Also, not sure if our playbooks are 100% idempotent right now, but I hope it's not a problem - in some other issues I saw that nomad shall handle VM failures and container failures (restarting Docker I would consider as temporary container failure).
Hello,
is there anyone which could help solve it ?
I got the same error reported on this issue.! I am using nomad 0.7.0
Here is the log I got from the UI.

@weslleycamilo This is a regression due to Docker changing the returned error code and thus breaking our error handling. Will be resolved in https://github.com/hashicorp/nomad/pull/3513 which will be part of 0.7.1
@dadgar hmm great but what is the version which it still working ? Do you know it ? I tried the version 0.6.3 and got the same error.
@weslleycamilo It depends on the Nomad and Docker Engine pairing. Docker changed their error message recently (not exactly sure on the version) and thus the error handling we have wasn't being triggered. The new error handling should be robust against both versions of the error message Docker returns.
@dadgar Do you know any nomad website which tells me which nomad and docker version are compatible?
It seems nomad 0.7.0 is not production ready.. it would be critical once I can't get the docker image back after docker restart or restart the host.
Can I keep using nomad 0.6.3 ? Which docker use to work with it ? At the moment I am testing nomad to go to production with it but i believe I dont have to stuck to this issue.
@weslleycamilo There is a bug in Docker 17.09 that broke Nomad's name conflict code path. This meant on Docker daemon restarts Nomad would be unable to restart the container. I've attached a test binary to the PR if you're able to give a shot! #3551
Hi @schmichael ,
I've been testing it for a while and it is working properly now.
Great job.
Thank you.
Thank you @dadgar
Most helpful comment
Hi @schmichael ,
I've been testing it for a while and it is working properly now.
Great job.
Thank you.
Thank you @dadgar