Nomad: Docker jobs marked as dead while still running

Created on 25 Oct 2016 · 7Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.4.1

Operating system and Environment details

Centos 7.2.1511, kernel 3.10.0-327.22.2.el7.x86_64

Issue

The docker socket was unavailable for a short period of time (<30 s). Nomad then marked the jobs as dead. They were not scheduled on another node, but rather abandoned. Additionally, when the socket became available again, Nomad did not stop the containers.
This is problematic, since when we noticed that nomad status reported the services as dead, we did nomad run some-service.job-file, and they started, but we now had twice as many running containers as we thought we did.

We did maintenance on three nodes (which caused the socket to be unavailable) and all three exhibited this behavior.

Reproduction steps

Remove access to socket (not sure if relevant, but for us the socket is not the actual docker socket, but a Weave proxy socket, which was the piece under maintenance)

Nomad Client logs

https://gist.github.com/carlpett/f2c25dfc456d8c54663aa836cd4fbc7e

themdrivedocker typenhancement

Source

carlpett

👍3

Most helpful comment

We are still seeing this in production. Seems like a few releases has come and gone. Is there any work around to this?

memelet on 29 Jan 2018

👍3

All 7 comments

@carlpett Nomad stops it's driver the moment it thinks the container is dead. And currently, there is no reconcile loop in the client to ensure older tasks which the client thought are dead are actually killed if they re-appear somehow.

diptanu on 27 Oct 2016

@diptanu Yes, I see what was happening. However, this is a serious impediment to adopting Nomad (suddenly ending up with duplicate services without being able to detect it) for us, and probably a pretty common concern?

I'm thinking some sort of mechanism for this would be required for every driver?

carlpett on 27 Oct 2016

@diptanu Are there any plans for implementing this soon? We had an incident a few weeks ago which caused Nomad to move jobs because a client node was unreachable (network partition), but not reaping/stopping them when the node became reachable again. The jobs seem to be marked as lost, so shouldn't it be sufficient for the client to check the server for the desired job state when it reestablishes contact (or even with periodic polling)?

carlpett on 28 Jan 2017

@carlpett Yeah, we will do this in the next few releases. We will add some reconciliation on the drivers to handle this case.

diptanu on 31 Jan 2017

We are still seeing this in production. Seems like a few releases has come and gone. Is there any work around to this?

memelet on 29 Jan 2018

👍3

@memelet Until there is a reconciler internally you would have to run a script to detect what allocations should be running and stop those that shouldn't manually. Work in 0.8 should mitigate this by detecting unhealthy drivers and not placing more on them but future releases will properly address this.