Nomad: [0.6.0dev c96995b] allocation does not exist, stuck in pending

Created on 28 Jun 2017  路  8Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.6.0-dev (c96995b3f56b787be7356ae3c639b451519b3b09+CHANGES)
it was the latest stable build on the CI-server for linux, at the time (https://travis-ci.org/hashicorp/nomad/builds/245519967)

Operating system and Environment details

Ubuntu 16.04.2 LTS

Issue

Allocation folder is not created.
After stopping and starting correctly for a few hours, tasks stay in "pending" and allocation does not exist. After that it is every time reproducibly.

Reproduction steps

Stop and start a system-type job (zookeeper) with raw-exec, continuously, for a few hours.

Starting the job

==> Monitoring evaluation "c25a1f6f"
Evaluation triggered by job "zookeeper"
Allocation "85cd2cdb" created: node "7dcfd48d", group "zk"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "c25a1f6f" finished with status "complete"

checking the job

Allocations
ID Node ID Task Group Desired Status Created At
85cd2cdb-e46d-63d4-28c7-75c32ebb019b 7dcfd48d-b306-1771-2b91-b334f650e669 zk run pending 06/28/17 14:28:48 CEST

ID = 85cd2cdb-e46d-63d4-28c7-75c32ebb019b
Eval ID = c25a1f6f-c0a7-3d41-2806-1ffc610060ef
Name = zookeeper.zk[0]
Node ID = 7dcfd48d-b306-1771-2b91-b334f650e669
Job ID = zookeeper
Client Status = pending
Client Description =
Desired Status = run
Desired Description =
Created At = 06/28/17 14:28:48 CEST
Evaluated Nodes = 1
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 32.143碌s
Failures = 0

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Unexpected response code: 500 (unknown allocation ID "85cd2cdb-e46d-63d4-28c7-75c32ebb019b")

Nomad Server logs (if appropriate)

2017/06/28 14:28:48.191792 [DEBUG] http: Request /v1/node/7dcfd48d-b306-1771-2b91-b334f650e669/allocations (5.285766ms)
2017/06/28 14:28:48.218902 [DEBUG] http: Request /v1/jobs (184.575碌s)
2017/06/28 14:28:48.269848 [DEBUG] http: Request /v1/evaluation/63dd241d-2292-0d91-7bc0-8c144da0b2d3 (97.452碌s)
2017/06/28 14:28:48.270547 [DEBUG] http: Request /v1/evaluation/63dd241d-2292-0d91-7bc0-8c144da0b2d3/allocations (74.02碌s)
2017/06/28 14:28:48.289463 [DEBUG] http: Request /v1/jobs (4.823423ms)
2017/06/28 14:28:48.289480 [DEBUG] worker: dequeued evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.289558 [DEBUG] sched: <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
2017/06/28 14:28:48.291087 [DEBUG] worker: submitted plan for evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.291106 [DEBUG] sched: <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>: setting status to complete
2017/06/28 14:28:48.292026 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef (147.784碌s)
2017/06/28 14:28:48.292967 [DEBUG] worker: updated evaluation <Eval 'c25a1f6f-c0a7-3d41-2806-1ffc610060ef' JobID: 'zookeeper'>
2017/06/28 14:28:48.292994 [DEBUG] worker: ack for evaluation c25a1f6f-c0a7-3d41-2806-1ffc610060ef
2017/06/28 14:28:48.293086 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef/allocations (59.616碌s)
2017/06/28 14:28:49.295300 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef (93.644碌s)
2017/06/28 14:28:49.296089 [DEBUG] http: Request /v1/evaluation/c25a1f6f-c0a7-3d41-2806-1ffc610060ef/allocations (76.657碌s)
2017/06/28 14:28:51.120297 [DEBUG] http: Request /v1/agent/members (161.74碌s)
2017/06/28 14:28:51.120904 [DEBUG] http: Request /v1/status/leader?region=global (26.519碌s)
2017/06/28 14:28:51.134218 [DEBUG] http: Request /v1/nodes (88.538碌s)
2017/06/28 14:28:51.140131 [DEBUG] http: Request /v1/node/7dcfd48d-b306-1771-2b91-b334f650e669/allocations (5.196479ms)
2017/06/28 14:28:51.167762 [DEBUG] http: Request /v1/jobs (288.523碌s)

Nomad Client logs (if appropriate)

Job file (if appropriate)

stagwaiting-reply typbug

All 8 comments

We've tested with build 5811, but in the pull requests, there's this merge about persistent allocations and new evaluations, which at least to us _sounds related:_ https://github.com/hashicorp/nomad/commit/5d9ca1b6840a259e2629ca5f5167a0767a91e8ec

Currently re-running tests with build 5839 at revision e9a55d9 to see.

problem is not reproducible by stopping and starting jobs with version 0.5.6, at least so far.

@DapperTayra Can you share the client logs

Can you also share nomad alloc-status -json <id> of one of the pending allocs

Here is the json data.
json.txt

We have server and client running together.

We stopped the previous task. Its allocation is still available: Its allocation's executor's log is this:
2017/06/29 17:40:14.847440 [DEBUG] executor: launching command /opt/jacktime/scripts/start_zk.sh

The newly started task doesn't have an allocation folder though.

We are seeing the same issue.

  • using 0.6.0-rc1 on Ubu 16.04
  • with a service-type job using raw_exec, has 20 task groups of 1 task each
  • hourly restarts/stops/starts work fine 0.5.6
  • getting about 4 out of 20 task groups with this half-pending-no-alloc thing every time

Should be fixed by #2852 which will be released in 0.6rc2 soon.

Was this page helpful?
0 / 5 - 0 ratings