Nomad: client.vault: failed to derive vault tokens

Created on 1 Dec 2016  Â·  14Comments  Â·  Source: hashicorp/nomad

Nomad version

Nomad v0.5.0

Operating system and Environment details

Fully operational Vault cluster running 0.6.2.

Issue

A task allocation that uses the Vault integration occasionally fails to start on a client, with the errors client.vault: failed to derive vault tokens: Allocation "…" does not exist and client: failed to derive Vault token for task agent on alloc "…": Allocation "…" does not exist. The task does not get reschedule elsewhere in the cluster.

Reproduction steps

Unsure.

Nomad Server logs (if appropriate)

2016/12/01 15:55:13.688149 [DEBUG] worker: dequeued evaluation 78564c7a-7b75-698b-8625-73c3b6659fa1
2016/12/01 15:55:13.688246 [DEBUG] sched: <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>: allocs: (place 0) (update 0) (migrate 2) (stop 0) (ignore 1) (lost 0)
2016/12/01 15:55:13.707710 [DEBUG] sched: <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>: setting status to complete
2016/12/01 15:55:13.718315 [DEBUG] worker: updated evaluation <Eval '78564c7a-7b75-698b-8625-73c3b6659fa1' JobID: 'gocd-agents'>
2016/12/01 15:55:13.718358 [DEBUG] worker: ack for evaluation 78564c7a-7b75-698b-8625-73c3b6659fa1

Nomad Client logs (if appropriate)

# nomad alloc-status b11047c8
ID = b11047c8
Eval ID = 5f420faf
Name = gocd-agents.rkt-agents[1]
Node ID = bae50522
Job ID = gocd-agents
Client Status = failed
Client Description = <none>
Created At = 12/01/16 15:55:13 UTC
Task "agent" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
2400 MHz 1.0 GiB 0 B 0
Recent Events:
Time Type Description
12/01/16 15:55:13 UTC Killing Killing task: vault: failed to derive token: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
12/01/16 15:55:13 UTC Received Task received by client
2016/12/01 15:55:13.661514 [DEBUG] client: updated allocations at index 1434 (pulled 1) (filtered 0)
2016/12/01 15:55:13.662682 [DEBUG] client: allocs: (added 1) (removed 0) (updated 0) (ignore 0)
2016/12/01 15:55:13.663480 [DEBUG] client: starting task runners for alloc 'b11047c8-ad1c-240a-5958-3d046fe69cdb'
2016/12/01 15:55:13.665753 [DEBUG] client: starting task context for 'agent' (alloc 'b11047c8-ad1c-240a-5958-3d046fe69cdb')
2016/12/01 15:55:13.665928 [DEBUG] client: waiting for Vault token for task agent in alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb"
2016/12/01 15:55:13.666572 [ERR] client.vault: failed to derive vault tokens: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
2016/12/01 15:55:13.666586 [ERR] client: failed to derive Vault token for task agent on alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb": Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist
2016/12/01 15:55:13.666596 [DEBUG] client: killing task agent for alloc "b11047c8-ad1c-240a-5958-3d046fe69cdb": vault: failed to derive token: Allocation "b11047c8-ad1c-240a-5958-3d046fe69cdb" does not exist

@jippi @diptanu

themvault typbug

All 14 comments

I got a a gist with same issue here

https://gist.github.com/jippi/8b8a10ae2943f58063926611d920651e#file-gistfile1-txt-L26-L30

Alloc ID b739058d-f5fa-a668-a3a1-965c22f7d45c is the failed
Alloc ID 5e82920d-c8f1-b58c-e658-e8c3374c33a4 is the resubmitted alloc for the exact same job

Amazing what one boolean can do :) Sorry you all ran into this!

@dadgar i saw this in my server logs

nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:25.256117 [DEBUG] http: Request /v1/allocation/85a65983-5365-979f-a753-24954ad6c6d9?index=190&wait=10000ms (10.00699977s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:26 [DEBUG] memberlist: TCP connection from=127.0.0.1:29826
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:27.553697 [DEBUG] http: Request /v1/agent/members?wait=60000ms (380.564µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:29.582595 [DEBUG] http: Request /v1/status/peers (855.497µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:30.576559 [DEBUG] http: Request /v1/status/leader?wait=60000ms (835.412µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:31.957122 [DEBUG] http: Request /v1/jobs?index=177&wait=60000ms (10.485035003s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:31.959884 [DEBUG] http: Request /v1/evaluations?index=178&wait=60000ms (13.585154802s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.070665 [DEBUG] http: Request /v1/allocations?index=191&wait=60000ms (215.785663ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.084001 [DEBUG] http: Request /v1/evaluations?index=193&wait=60000ms (122.137931ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.132994 [DEBUG] http: Request /v1/allocations?index=194&wait=60000ms (59.04641ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.211701 [DEBUG] http: Request /v1/allocation/85a65983-5365-979f-a753-24954ad6c6d9?index=1&wait=60000ms (1.294446ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.308436 [DEBUG] worker: dequeued evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.308691 [DEBUG] http: Request /v1/evaluations?index=195&wait=60000ms (222.080217ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.408943 [DEBUG] sched: <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>: allocs: (place 9) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.420846 [DEBUG] http: Request /v1/jobs?index=198&wait=60000ms (111.403128ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.420921 [DEBUG] worker: submitted plan for evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.421443 [DEBUG] sched: <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>: setting status to complete
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.422136 [DEBUG] http: Request /v1/allocations?index=197&wait=60000ms (124.062728ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.425240 [DEBUG] worker: updated evaluation <Eval '16d38c9f-afcf-65ef-fc48-5deb1673f18e' JobID: 'insights-worker'>
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.426181 [DEBUG] worker: ack for evaluation 16d38c9f-afcf-65ef-fc48-5deb1673f18e
nomad-stdout---supervisor-wbwFty.log:    2016/12/01 15:32:32.426779 [ERR] nomad.client: DeriveVaultToken failed (recoverable false): Allocation "b739058d-f5fa-a668-a3a1-965c22f7d45c" does not exist
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.480758 [ERR] nomad.node: Vault token creation failed: failed to create token for task "server": Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.480787 [ERR] nomad.node: Vault token revocation failed: Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.533341 [DEBUG] http: Request /v1/allocations?index=200&wait=60000ms (107.325842ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.733223 [DEBUG] http: Request /v1/allocations?index=208&wait=60000ms (44.997123ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:32.842071 [DEBUG] http: Request /v1/allocations?index=209&wait=60000ms (106.692464ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:36 [DEBUG] memberlist: TCP connection from=127.0.0.1:29860
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.481785 [ERR] nomad.node: Vault token creation failed: failed to create token for task "server": Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.482272 [ERR] nomad.node: Vault token revocation failed: Vault client not active
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:37.567981 [DEBUG] http: Request /v1/status/leader?wait=60000ms (720.629µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:39.587235 [DEBUG] http: Request /v1/status/peers (3.062137ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:40.248259 [DEBUG] http: Request /v1/nodes?index=97&wait=60000ms (1m1.399233777s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:40.579541 [DEBUG] http: Request /v1/agent/members?wait=60000ms (182.329µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:45.369737 [DEBUG] http: Request /v1/nodes?index=97&wait=60000ms (1m1.729887962s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:46 [DEBUG] memberlist: TCP connection from=127.0.0.1:29948
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:49.590262 [DEBUG] http: Request /v1/status/peers (767.4µs)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.617006 [DEBUG] http: Request /v1/evaluations?index=201&wait=60000ms (19.185764303s)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.760506 [DEBUG] http: Request /v1/evaluations?index=212&wait=60000ms (140.72864ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51.760506 [DEBUG] http: Request /v1/evaluations?index=212&wait=60000ms (141.04109ms)
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:51 [DEBUG] memberlist: Initiating push/pull sync with: 195.154.35.151:4648
nomad-stdout---supervisor-wbwFty.log-    2016/12/01 15:32:55.477508 [DEBUG] http: Request /v1/allocations?index=219&wait=60000ms (181.799344ms)

@dadgar emphasis on Vault client not active

Yeah the PR I just made will address that

Subtle bug!

@dadgar good job!

To boolean, or ! to boolean, that is the question.

@dadgar just received this on 0.7.0
logs from master or client can help?

@burdandrei I would grab both

Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]: 2018-01-23T17:54:53.052Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-01-23T17:54:53.052Z address=/tmp/plugin053837145 network=unix
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.243216 [INFO] driver.docker: created container 765b1b7fefa45b801f62f749a383cffba1a814152f39b7f2271343939e3d9af2
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.349977 [INFO] driver.docker: started container 765b1b7fefa45b801f62f749a383cffba1a814152f39b7f2271343939e3d9af2
Jan 23 17:54:53 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:54:53.361809 [WARN] client: error fetching stats of task telegraf: stats collection hasn't started yet
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571444 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571469 [ERR] client.vault: failed to derive token for allocation "df8f5fdc-e431-3e85-5e5f-fce74491ac7a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:55:51 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:55:51.571477 [ERR] client: failed to derive Vault token for task unicorn on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers); retrying in 5s
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573117 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573145 [ERR] client.vault: failed to derive token for allocation "df8f5fdc-e431-3e85-5e5f-fce74491ac7a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Jan 23 17:56:56 ip-10-1-1-197 nomad[6241]:     2018/01/23 17:56:56.573154 [ERR] client: failed to derive Vault token for task unicorn on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": failed to create token for task "unicorn" on alloc "df8f5fdc-e431-3e85-5e5f-fce74491ac7a": Post https://vault.yotpo.com:8200/v1/auth/token/create/nomad-cluster: net/http: request canceled (Client.Timeout exceeded while awaiting headers); retrying in 20s

this is from client, same on server.

minute before it changed to

Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) initiating run
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938211 [ERR] client.vault: failed to derive vault tokens: failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938231 [ERR] client.vault: failed to derive token for allocation "403b5afa-3304-8673-1044-5a37d1723f5a" and tasks [unicorn]: failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.938238 [ERR] client: failed to derive Vault token for task unicorn on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": failed to create token for task "unicorn" on alloc "403b5afa-3304-8673-1044-5a37d1723f5a": Connection to Vault failed: 2 error(s) occurred:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: POST https://vault.yotpo.com:8200/v1/sys/capabilities-self
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * failed to lookup role "nomad-cluster": Error making API request.
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: URL: GET https://vault.yotpo.com:8200/v1/auth/token/roles/nomad-cluster
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: Code: 503. Errors:
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]: * Vault is sealed
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) stopping
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34.945649 [INFO] client: marking allocation 403b5afa-3304-8673-1044-5a37d1723f5a for GC
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/403b5afa-3304-8673-1044-5a37d1723f5a/telegraf/local/telegraf.conf"
Jan 23 18:11:34 ip-10-1-1-197 nomad[17071]:     2018/01/23 18:11:34 [INFO] (runner) received finish

in consul all the cluster is grean and is showing that it's unsealed!
Vault 0.8.3

@dadgar you can catch me on Vault gitter now, before i started to restart vaults =)

@burdandrei Either send your servers a SIGHUP or restart them. The Vault server was sealed and the Nomad server has given up contacting the Vault. I filled https://github.com/hashicorp/nomad/issues/3786 to improve this behavior.

@dadgar it's working now.
I'll check the logs, but during all the accident all 5 vaults were unsealed.
i restarted the leader just in case, new leader was re-elected and all worked great.
Nomad on the other side i think was trying to contact vault with not renewed token.
during the accident I talked to @jippi and he says that they are still seeing the same problem from time to time. i'll try to gather vault audit logs tomorrow and see what we got there.
thanks for super fast responses!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ashald picture ashald  Â·  3Comments

jippi picture jippi  Â·  3Comments

clinta picture clinta  Â·  3Comments

joliver picture joliver  Â·  3Comments

bdclark picture bdclark  Â·  3Comments