Nomad v0.4.1
Centos 7.2.1511, kernel 3.10.0-327.22.2.el7.x86_64
The docker socket was unavailable for a short period of time (<30 s). Nomad then marked the jobs as dead. They were not scheduled on another node, but rather abandoned. Additionally, when the socket became available again, Nomad did not stop the containers.
This is problematic, since when we noticed that nomad status reported the services as dead, we did nomad run some-service.job-file, and they started, but we now had twice as many running containers as we thought we did.
We did maintenance on three nodes (which caused the socket to be unavailable) and all three exhibited this behavior.
Remove access to socket (not sure if relevant, but for us the socket is not the actual docker socket, but a Weave proxy socket, which was the piece under maintenance)
https://gist.github.com/carlpett/f2c25dfc456d8c54663aa836cd4fbc7e
@carlpett Nomad stops it's driver the moment it thinks the container is dead. And currently, there is no reconcile loop in the client to ensure older tasks which the client thought are dead are actually killed if they re-appear somehow.
@diptanu Yes, I see what was happening. However, this is a serious impediment to adopting Nomad (suddenly ending up with duplicate services without being able to detect it) for us, and probably a pretty common concern?
I'm thinking some sort of mechanism for this would be required for every driver?
@diptanu Are there any plans for implementing this soon? We had an incident a few weeks ago which caused Nomad to move jobs because a client node was unreachable (network partition), but not reaping/stopping them when the node became reachable again. The jobs seem to be marked as lost, so shouldn't it be sufficient for the client to check the server for the desired job state when it reestablishes contact (or even with periodic polling)?
@carlpett Yeah, we will do this in the next few releases. We will add some reconciliation on the drivers to handle this case.
We are still seeing this in production. Seems like a few releases has come and gone. Is there any work around to this?
@memelet Until there is a reconciler internally you would have to run a script to detect what allocations should be running and stop those that shouldn't manually. Work in 0.8 should mitigate this by detecting unhealthy drivers and not placing more on them but future releases will properly address this.
Doing some old issue cleanup. This is fixed in 0.10.2 with the addition of the Docker reconciler loop.
Most helpful comment
We are still seeing this in production. Seems like a few releases has come and gone. Is there any work around to this?