Nomad: [0.6.0dev c96995b] allocation does not exist, stuck in pending

Created on 28 Jun 2017 · 8Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.6.0-dev (c96995b3f56b787be7356ae3c639b451519b3b09+CHANGES)
it was the latest stable build on the CI-server for linux, at the time (https://travis-ci.org/hashicorp/nomad/builds/245519967)

Operating system and Environment details

Ubuntu 16.04.2 LTS

Issue

Allocation folder is not created.
After stopping and starting correctly for a few hours, tasks stay in "pending" and allocation does not exist. After that it is every time reproducibly.

Reproduction steps

Stop and start a system-type job (zookeeper) with raw-exec, continuously, for a few hours.

Starting the job

==> Monitoring evaluation "c25a1f6f"
Evaluation triggered by job "zookeeper"
Allocation "85cd2cdb" created: node "7dcfd48d", group "zk"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "c25a1f6f" finished with status "complete"

checking the job

Allocations
ID Node ID Task Group Desired Status Created At
85cd2cdb-e46d-63d4-28c7-75c32ebb019b 7dcfd48d-b306-1771-2b91-b334f650e669 zk run pending 06/28/17 14:28:48 CEST

ID = 85cd2cdb-e46d-63d4-28c7-75c32ebb019b
Eval ID = c25a1f6f-c0a7-3d41-2806-1ffc610060ef
Name = zookeeper.zk[0]
Node ID = 7dcfd48d-b306-1771-2b91-b334f650e669
Job ID = zookeeper
Client Status = pending
Client Description =
Desired Status = run
Desired Description =
Created At = 06/28/17 14:28:48 CEST
Evaluated Nodes = 1
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 32.143µs
Failures = 0

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "85cd2cdb-e46d-63d4-28c7-75c32ebb019b")

Nomad Server logs (if appropriate)

2017/06/28 14:28:48.191792 [DEBUG] http: Request /v1/node/7dcfd48d-b306-1771-2b91-b334f650e669/allocations (5.285766ms)
2017/06/28 14:28:48.218902 [DEBUG] http: Request /v1/jobs (184.575µs)
2017/06/28 14:28:48.269848 [DEBUG] http: Request /v1/evaluation/63dd241d-2292-0d91-7bc0-8c144da0b2d3 (97.452µs)
2017/06/28 14:28:48.270547 [DEBUG] http: Request /v1/evaluation/63dd241d-2292-0d91-7bc0-8c144da0b2d3/allocations (74.02µs)
2017/06/28 14:28:48.289463 [DEBUG] http: Request /v1/jobs (4.823423ms)
2017/06/28 14:28:48.289480 [DEBUG] worker: dequeued evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.289558 [DEBUG] sched: <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
2017/06/28 14:28:48.291087 [DEBUG] worker: submitted plan for evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.291106 [DEBUG] sched: <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>: setting status to complete
2017/06/28 14:28:48.292026 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef (147.784µs)
2017/06/28 14:28:48.292967 [DEBUG] worker: updated evaluation <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>
2017/06/28 14:28:48.292994 [DEBUG] worker: ack for evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.293086 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef/allocations (59.616µs)
2017/06/28 14:28:49.295300 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef (93.644µs)
2017/06/28 14:28:49.296089 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef/allocations (76.657µs)
2017/06/28 14:28:51.120297 [DEBUG] http: Request /v1/agent/members (161.74µs)
2017/06/28 14:28:51.120904 [DEBUG] http: Request /v1/status/leader?region=global (26.519µs)
2017/06/28 14:28:51.134218 [DEBUG] http: Request /v1/nodes (88.538µs)
2017/06/28 14:28:51.140131 [DEBUG] http: Request /v1/node/7dcfd48d-b306-1771-2b91-b334f650e669/allocations (5.196479ms)
2017/06/28 14:28:51.167762 [DEBUG] http: Request /v1/jobs (288.523µs)

Nomad Client logs (if appropriate)

Job file (if appropriate)

stagwaiting-reply typbug

Source

DapperTayra

All 8 comments

We've tested with build 5811, but in the pull requests, there's this merge about persistent allocations and new evaluations, which at least to us _sounds related:_ https://github.com/hashicorp/nomad/commit/5d9ca1b6840a259e2629ca5f5167a0767a91e8ec

Currently re-running tests with build 5839 at revision e9a55d9 to see.

DapperTayra on 28 Jun 2017

nope, problem persists with
https://github.com/hashicorp/nomad/commit/e9a55d9e0ec8f3ac9df939a70c4946cb2ee7b55b

DapperTayra on 28 Jun 2017

problem is not reproducible by stopping and starting jobs with version 0.5.6, at least so far.

DapperTayra on 28 Jun 2017

@DapperTayra Can you share the client logs

dadgar on 28 Jun 2017

Can you also share nomad alloc-status -json <id> of one of the pending allocs

dadgar on 28 Jun 2017

Here is the json data.
json.txt

We have server and client running together.

We stopped the previous task. Its allocation is still available: Its allocation's executor's log is this:
2017/06/29 17:40:14.847440 [DEBUG] executor: launching command /opt/jacktime/scripts/start_zk.sh

The newly started task doesn't have an allocation folder though.

DapperTayra on 29 Jun 2017

We are seeing the same issue.

using 0.6.0-rc1 on Ubu 16.04
with a service-type job using raw_exec, has 20 task groups of 1 task each
hourly restarts/stops/starts work fine 0.5.6
getting about 4 out of 20 task groups with this half-pending-no-alloc thing every time