Nomad: Interrupted tasks in Docker fail restarting due to "container already exists"

Created on 12 Dec 2016 · 15Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.5.0

Operating system and Environment details

Ubuntu 16.04 x86_64
Docker version 1.12.1, build 23cf638 (apt install docker.io)

Issue

When interrupting tasks on a client system, by rebooting or stopping Nomad and Docker services, the tasks fail to restart due to Nomad failing to start the tasks as it finds the previous container still present.

Reproduction steps

Setup two hosts, {1} with Nomad client+server and {2} with Nomad client
Start a job with count ≥2 so tasks are started in containers on both hosts
Restart host {2}
When Nomad tries to reschedule the tasks on {2}, their allocations fail with the following error:

12/12/16 15:21:54 UTC  Driver Failure  failed to start task 'mytask' for alloc 'cb2604b0-fafe-a05b-66f0-484caedba5ce': Failed to create container: container already exists

These tasks stay "Starting" and never switch to "Running" due the the above error. Removing the containers with docker rm $(docker ps -aq) allows them to start again.

For obvious reasons, adding a script that removes all Docker images on boot would not be a good solution.

Nomad Server or Client logs do not contain anything relevant to this error.

Related to parts of the discussion on #2016

themdrivedocker

Source

hoh

Most helpful comment

Hi @schmichael ,
I've been testing it for a while and it is working properly now.

Great job.
Thank you.

Thank you @dadgar

weslleycamilo on 16 Nov 2017

👍2

All 15 comments

Can you try RC2: We made quite a few improvements to the docker driver attempting to remedy this issue: https://releases.hashicorp.com/nomad/0.5.1-rc2/

dadgar on 12 Dec 2016

Thanks ! Just tried and I could not reproduce the issue with RC2.

hoh on 15 Dec 2016

Having the same core issue, although scenario is a bit different:

run services on nomad
redeploy the whole cluster which restarts docker daemon
nomad job stays in state pending and alloc-status shows

12/22/16 16:27:43 CET  Restarting      Task restarting in 25.881394077s
12/22/16 16:27:43 CET  Driver Failure  failed to start task 'traefik' for alloc '06fecd6d-81d9-7a16-5c82-c770743d68d8': Failed to create container: container already exists

After removing the container manually, nomad is able to reschedule

mlushpenko on 22 Dec 2016

Didn't help:

# nomad --version Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')

# nomad status
ID               Type     Priority  Status
jenkins-master   service  50        dead
nexus            service  50        dead
registry         service  50        dead
selenium-chrome  service  50        dead
selenium-hub     service  50        dead
traefik          system   60        running

# nomad status -verbose traefik
ID          = traefik
Name        = traefik
Type        = system
Priority    = 60
Datacenters = amersfoort
Status      = running
Periodic    = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
loadbalancing  0       3         0        8       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
34a26bf4-4147-4c25-0b8e-09881eabd0e0  60        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group     Desired  Status   Created At
d3b6a7d3-a436-f6ca-d470-f672fb164099  34a26bf4-4147-4c25-0b8e-09881eabd0e0  cdaaa02d-40ee-d341-197b-7eee724babfb  loadbalancing  run      pending  12/22/16 15:02:21 CET
f73affb6-6173-07b6-4e1a-1c80bcc6cd3c  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ece479e0-1740-aea8-88d7-5f93c57696fc  loadbalancing  run      pending  12/22/16 15:02:21 CET
06fecd6d-81d9-7a16-5c82-c770743d68d8  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ee191d9f-509f-afc7-c096-54fc7c10c8bb  loadbalancing  run      pending  12/21/16 17:02:59 CET

# nomad alloc-status -verbose d3b6a7d3-a436-f6ca-d470-f672fb164099
ID                 = d3b6a7d3-a436-f6ca-d470-f672fb164099
Eval ID            = 34a26bf4-4147-4c25-0b8e-09881eabd0e0
Name               = traefik.loadbalancing[0]
Node ID            = cdaaa02d-40ee-d341-197b-7eee724babfb
Job ID             = traefik
Client Status      = pending
Client Description = <none>
Created At         = 12/22/16 15:02:21 CET
Evaluated Nodes    = 1
Filtered Nodes     = 0
Exhausted Nodes    = 0
Allocation Time    = 16.788µs
Failures           = 0

Task "traefik" is "pending"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
500 MHz  128 MiB  0 B   0     http: <IP>:9999
                              ui: <IP>:9998

Recent Events:
Time                   Type            Description
12/22/16 16:55:16 CET  Restarting      Task restarting in 30.382471444s
12/22/16 16:55:16 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:51 CET  Restarting      Task restarting in 25.16167252s
12/22/16 16:54:51 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:25 CET  Restarting      Task restarting in 25.722048638s
12/22/16 16:54:25 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:58 CET  Restarting      Task restarting in 27.098521822s
12/22/16 16:53:58 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:29 CET  Restarting      Task restarting in 29.045220368s
12/22/16 16:53:29 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists

Placement Metrics
  * Score "cdaaa02d-40ee-d341-197b-7eee724babfb.binpack" = 2.587686

I installed new version, removed old containers, nomad rescheduled everything, then I did redeployment of the whole cluster and nomad can't scedule them again.

mlushpenko on 22 Dec 2016

@mlushpenko What do you mean redeploy the whole cluster?

dadgar on 3 Jan 2017

@dadgar our pipeline invokes Ansible playbooks that deploy consul in the beginning, then nomad, dnsmasq and docker to several VMs.

When we deploy services like Gitlab, we invoke another pipeline via Jenkins that deploys containers to nomad cluster. But sometimes we need to update the cluster itself: docker/consul/nomad, then we invoke the "base" pipeline, but encounter issues mentioned above. Redeploy the whole cluster = invoke "base" pipeline.

mlushpenko on 5 Jan 2017

@mlushpenko When you run that base pipeline are you doing in-place upgrades of Nomad/Docker or starting a new VM?

Stopping the docker engine is not really advisable.

dadgar on 5 Jan 2017

@dadgar in-place upgrades - within client's legacy infrastructure spinning up VMs on-demand is not an option.

Docker is being restarted and reloaded if there are some changes to Docker config. Also, not sure if our playbooks are 100% idempotent right now, but I hope it's not a problem - in some other issues I saw that nomad shall handle VM failures and container failures (restarting Docker I would consider as temporary container failure).

mlushpenko on 5 Jan 2017

Hello,

is there anyone which could help solve it ?

I got the same error reported on this issue.! I am using nomad 0.7.0

Here is the log I got from the UI.

weslleycamilo on 13 Nov 2017

@weslleycamilo This is a regression due to Docker changing the returned error code and thus breaking our error handling. Will be resolved in https://github.com/hashicorp/nomad/pull/3513 which will be part of 0.7.1

dadgar on 14 Nov 2017

👍1

@dadgar hmm great but what is the version which it still working ? Do you know it ? I tried the version 0.6.3 and got the same error.

weslleycamilo on 14 Nov 2017

@weslleycamilo It depends on the Nomad and Docker Engine pairing. Docker changed their error message recently (not exactly sure on the version) and thus the error handling we have wasn't being triggered. The new error handling should be robust against both versions of the error message Docker returns.

dadgar on 14 Nov 2017

@dadgar Do you know any nomad website which tells me which nomad and docker version are compatible?

It seems nomad 0.7.0 is not production ready.. it would be critical once I can't get the docker image back after docker restart or restart the host.

Can I keep using nomad 0.6.3 ? Which docker use to work with it ? At the moment I am testing nomad to go to production with it but i believe I dont have to stuck to this issue.

weslleycamilo on 14 Nov 2017

@weslleycamilo There is a bug in Docker 17.09 that broke Nomad's name conflict code path. This meant on Docker daemon restarts Nomad would be unable to restart the container. I've attached a test binary to the PR if you're able to give a shot! #3551