Nomad: Stuck allocation on dead job

Created on 16 Feb 2016  路  6Comments  路  Source: hashicorp/nomad

I'm new to all this so maybe I've just missed something but I appear to have an orphan allocation from a dead job that failed to completely start.

Context: Running v0.3.0rc1 in the dev environment created by the included vagrantfile. Running in --dev' mode (dual agent/client mode).

I started with a modified version of the example.nomad file created with nomad init and I modified the task to run a mysql container and added a second task to run an apache container. I started the job with nomad run but it failed to complete because I'd typo'd the apache container image name.

At this point I had a mysql container running but no apache container. So I edited the job to correct my typo and called nomad run again. My understanding was that it would evaluate the difference and just start the apache container (because the mysql container was already running).

However, it actually re-evaluated the entire job and started both the apache container _and_ a second mysql container, while leaving the original container running. Note that I have not changed the name of the job or the task group (I left them as example and cache, as per the original job config).

So I called nomad stop thinking it would clean everything up but it only stopped the new containers, leaving the original mysql container. I thought maybe nomad had 'forgotten' about it so killed it with Docker directly - but nomad put it back.

So now I have a mysql container that nomad is keeping alive but no job to control it with.

> nomad status example
No job(s) with prefix or id "example" found
> docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
266885ac1ee4        mysql:latest        "/entrypoint.sh mysql"   33 minutes ago      Up 33 minutes       127.0.0.1:23968->3306/tcp, 127.0.0.1:23968->3306/udp   mysql-bc506dfd-6351-ab4e-ad23-95c3fd971baa
> nomad alloc-status bc506dfd
ID              = bc506dfd
Eval ID         = 3226e9b9
Name            = example.cache[0]
Node ID         = f8e6eacc
Job ID          = example
Client Status   = failed
Evaluated Nodes = 1
Filtered Nodes  = 0
Exhausted Nodes = 0
Allocation Time = 2.21072ms
Failures        = 0

==> Task "apache" is "dead"
Recent Events:
Time                   Type            Description
16/02/16 21:33:52 UTC  Driver Failure  failed to start: Failed to pull `apache:latest`: Error: image library/apache not found

==> Task "mysql" is "running"
Recent Events:
Time                   Type        Description
16/02/16 21:49:12 UTC  Started     <none>
16/02/16 21:48:39 UTC  Terminated  Exit Code: 0
16/02/16 21:34:23 UTC  Started     <none>

==> Status
Allocation "bc506dfd" status "failed" (0/1 nodes filtered)
  * Score "f8e6eacc-46f7-18b0-df52-350346732e60.binpack" = 7.683003

So I'm not quite sure what to do next and I'm pretty certain this is not expected behaviour.

Any thoughts anyone?

themscheduling typbug

Most helpful comment

I am seeing this on Nomad v0.5.4.

I had a job that no longer exists with an allocation stuck on a node, trying to pull a container that no longer exists and receiving a 400 from the registry.
It's been doing this for a couple of weeks without getting cleaned up, so tonight I decided to restart the nomad agent which allowed the task to be killed.

Is it a regression, or have I triggered something completely new for some reason?

All 6 comments

@far-blue I could reproduce this, and thanks for reporting.

I ran into this exact issue when trying out 0.3.0-rc2. As far as I can tell the only way to clear out the orphaned allocation is to clobber the nomad servers and remove all existing state :/

@diptanu, is there someone actively working on this? If not I would be willing to take a crack at it.

@dgshep Yes! We might be able to tackle this in the next release.

Very cool. BTW Congrats on the C1M project! Stellar stuff...

I am seeing this on Nomad v0.5.4.

I had a job that no longer exists with an allocation stuck on a node, trying to pull a container that no longer exists and receiving a 400 from the registry.
It's been doing this for a couple of weeks without getting cleaned up, so tonight I decided to restart the nomad agent which allowed the task to be killed.

Is it a regression, or have I triggered something completely new for some reason?

Was this page helpful?
0 / 5 - 0 ratings