v0.12.5
Linux
Consul token is not auto-renewed by nomad when dynamic secrets are used in nomad job template with consul secrets engine.
Followed steps as specified in https://learn.hashicorp.com/tutorials/nomad/vault-postgres.
1) Enable consul secrets engine
2) Add role and policy
3) Configure nomad job template stanza to retrieve consul token
The default ttl for this token in 30 days after which the lease expires and nomad doesn't auto renew the token.
template {
data = <
registry.consul.token = {{ .Data.token }}
{{ end }}
registry.consul.kvpath = /path-to-config
registry.consul.noroutehtmlpath = /fabio/noroute.html
proxy.addr =
ui.addr =
registry.consul.register.addr =
metrics.target =
metrics.statsd.addr =
metrics.prefix =
metrics.names =
EOH
destination = “secrets/fabio.properties”
}
agent.http: Request error: method=GET url=/v1/kv/fabio/noroute.html?consistent=&index=1143000&recurse= from=127.0.0.1:60596 error="rpc error making call
@gulavanir - I was able to get this to work by setting an aggressive renewal time.
Basically I configured consul to NOT use the defaults of 32 days by running
vault secrets tune -default-lease-ttl=30s -max-lease-ttl=1m consul/
I was able to see Nomad/Consul-template refreshing the token every few seconds (given the aggressive expiry I set).
With the above settings I could see Nomad refresh the token and then when it reached the max-least-ttl value it deprecated the old token and issued a new one (which then caused Nomad to rerender the template and start the container)
I am wondering if what you see has to do with a watcher failing (since the renew time is so long by default) and it never recovers?
@fredwangwang was able to reproduce this with some pretty catastrophic results. I've asked him to update this thread, but here's his comment on the Hashicorp Discuss forums https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9
@tgross - this is a nasty one
tldr: state loss during nomad restart cause token lease leak in template
link to discussion for context: https://discuss.hashicorp.com/t/nomad-job-doesnt-renew-consul-token-generated-using-nomad-job-templates-and-consul-secrets-engine/18316/9
The issue is that during nomad restart, nomad loss its state. The new nomad client will:
However, at this point the configuration (hence token) held by the process running inside the container is still the old one. Once the old token lease expires, the process start to fail.
Logs (after nomad restart):
2020/12/03 05:30:24.733069 [INFO] (runner) creating watcher ### nomad creates a new watcher unknown the previous state
2020/12/03 05:30:24.733194 [INFO] (runner) starting
2020/12/03 05:30:24.733247 [DEBUG] (runner) running initial templates
2020/12/03 05:30:24.733311 [DEBUG] (runner) initiating run ### nomad renders the new template assuming a new job start, however the container was already running
2020/12/03 05:30:24.733362 [DEBUG] (runner) checking template 8bac18e0a2e861cfa119af4401831db6
2020/12/03 05:30:24.733883 [DEBUG] (runner) missing data for 1 dependencies
2020/12/03 05:30:24.733984 [DEBUG] (runner) missing dependency: vault.read(consul/creds/service-fabio)
2020/12/03 05:30:24.734036 [DEBUG] (runner) add used dependency vault.read(consul/creds/service-fabio) to missing since isLeader but do not have a watcher
2020/12/03 05:30:24.734103 [DEBUG] (runner) was not watching 1 dependencies
2020/12/03 05:30:24.734174 [DEBUG] (watcher) adding vault.read(consul/creds/service-fabio)
2020/12/03 05:30:24.734257 [TRACE] (watcher) vault.read(consul/creds/service-fabio) starting
I _think_ this is hitting a case a long-open PR https://github.com/hashicorp/nomad/issues/4226 is attempting to fix, but which we didn't have enough context for us to understand what was going on. I'm not sure why that hasn't had further review yet but we should make sure that gets onto our radar.
Also folks, please let's avoid editorializing in issue titles.
Ahh - thanks @tgross - good point on the title, and noted.
Seems like the long open PR is related - We can reliably replicate it at this point so any details you need our team can provide.
@tgross was thinking something similar to this https://github.com/hashicorp/nomad/issues/4226#issuecomment-422600477 last night. It _works_, however, it will trigger a coordinated restart of all the jobs running on that particular nodes whose template gets re-rendered every time nomad restarts.
Which is fine for some use cases where only a small number of job has dynamic credentials, it would be a non-insignificant impact to the system if almost all jobs running on that particular node restarts the same time.
So it _is_ a quick patch to this urgent problem, but to me feels like to properly correct this issue nomad should persist state across restart tho.
Some "thinking out loud" after some initial investigation:
vault_hook.go#L129-L143, which gets used during the restoration of the task runner during client restart.RenewToken should be getting a renewal from that token. I haven't yet identified if that's failing or what.Wanted to leave a status update on the investigation so far. There are a few layers to this problem as presented, that make it a little harder to understand:
When Nomad is persisting a token for the task runner in the vault_hook.go#L129-L143, that's _Nomad's_ token. The secrets for anything we render to a template have no saved state and that's the TTL that's expiring between restarts. This is effectively the same thing that would happen if we ran consul-template in watch mode from the command line, stopped it, and then ran it again. Even if we had a root Vault token here, if that were the problem it wouldn't help.
That also means we should be able to reproduce this with a simpler setup for any dynamic secrets, like a x509 cert. Fortunately we have an E2E test we can borrow for that.
setup and configuration files
Single-node Vault config:
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
storage "raft" {
path = "/var/vault/raft"
node_id = "raft_node_1"
}
api_addr = "http://127.0.0.1:8200"
cluster_addr = "https://127.0.0.1:8201"
ui = true
Bootstrap Vault with Nomad policy:
export VAULT_ADDRESS="http://127.0.0.1:8200"
vault operator init -key-shares=1 -key-threshold=1
vault operator unseal
export VAULT_TOKEN=<redacted>
vault policy write nomad-server ./e2e/terraform/scripts/vault-nomad-server-policy.hcl
vault write /auth/token/roles/nomad-cluster @./e2e/terraform/scripts/vault-nomad-cluster-role.json
vault token create -policy nomad-server -period 72h -orphan
# get the resulting token for our Nomad config
Single-node Nomad config:
name = "standalone"
log_level = "DEBUG"
enable_debug = true
data_dir = "/var/nomad"
server {
enabled = true
bootstrap_expect = 1
}
client {
enabled = true
}
plugin "docker" {
config {
volumes {
enabled = true
}
}
}
vault {
enabled = true
address = "http://localhost:8200"
create_from_role = "nomad-cluster"
token = "<redacted>"
}
Configure PKI engine
pki-policy.hcl file:
path "pki-test9491/issue/nomad" {
capabilities = ["create", "update", "read"]
}
vault secrets enable -path=pki-test9491 pki
vault write pki-test9491/root/generate/internal common_name=service.consul ttl=168h
vault write pki-test9491/roles/nomad \
allowed_domains=service.consul \
allow_subdomains=true \
generate_lease=true \
max_ttl=1m
vault secrets tune -max-lease-ttl=1m pki-test9491
vault policy write pki-test9491 ./pki-policy.hcl
jobspec
job "example" {
datacenters = ["dc1"]
group "group" {
task "task" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/www"]
volumes = ["/tmp/www:/www:ro"]
}
vault {
policies = ["pki-test9491"]
}
template {
data = <<EOT
{{ with secret "pki-test9491/issue/nomad" "common_name=nomad.service.consul" "ip_sans=127.0.0.1" }}
{{- .Data.certificate -}}
{{ end }}
EOT
destination = "${NOMAD_SECRETS_DIR}/certificate.crt"
change_mode = "noop"
}
resources {
cpu = 128
memory = 64
}
}
}
}
Note in our jobspec that the change_mode is "noop". Let's verify the normal secret renewal behavior works as we'd expect:
$ nomad job run ./example.nomad
==> Monitoring evaluation "8ff50f77"
Evaluation triggered by job "example"
==> Monitoring evaluation "8ff50f77"
Evaluation within deployment: "0aee964a"
Allocation "d9b1a5c8" created: node "53d42465", group "group"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "8ff50f77" finished with status "complete"
$ nomad alloc status d9b
...
Recent Events:
Time Type Description
2020-12-09T19:19:58Z Started Task started by client
2020-12-09T19:19:57Z Task Setup Building Task Directory
2020-12-09T19:19:57Z Received Task received by client
# fingerprint the cert
$ nomad alloc exec d9b cat /secrets/certificate.crt | md5sum
2e5c01b4bf659e298664a88bf3798663 -
# wait until the TTL expires (plus some room), and verify the cert has changed
$ sleep 90 && nomad alloc exec d9b cat /secrets/certificate.crt | md5sum
7cae1b2510b24b02a0194ca6a231239e -
Now let's test restarting Nomad and checking that the certificate is renewed when we expect:
test1.sh
#!/bin/bash
alloc_id=$1
now=$(date +%s)
echo "now: $now"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"
echo "restarting Nomad..."
sudo systemctl restart nomad
now=$(date +%s)
sleep_for=$(( $expiry - $now ))
echo "waiting ${sleep_for}s for cert to expire..."
sleep $sleep_for
new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"
if [[ "$new_expiry" != "$expiry" ]]; then
echo "certificate was renewed!"
fi
Results look good:
$ ./test1.sh d9b
now: 1607544427
expires at: 1607544475
restarting Nomad...
waiting 48s for cert to expire...
new expiry: 1607544493
certificate was renewed!
Now we'll test stopping Nomad, waiting for expiry, and then restarting Nomad. In this test we wait to stop Nomad so that we're not risking the allocation being marked as lost and confounding the problem.
test2.sh
#!/bin/bash
alloc_id=$1
start=$(date +%s)
echo "now: $start"
expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "expires at: $expiry"
sleep_for=$(( $expiry - $start ))
if [[ "$sleep_for" -gt "30" ]]; then
echo "waiting 30s to stop nomad..."
sleep 30
else
echo "only ${sleep_for}s left till expiry, stopping now..."
fi
echo "stopping Nomad..."
sudo systemctl stop nomad
start=$(date +%s)
sleep_for=$(( $expiry - $start ))
if [[ "$sleep_for" -gt "0" ]]; then
echo "waiting ${sleep_for}s for cert to expire, +5s..."
sleep $sleep_for
sleep 5
fi
echo "restarting nomad..."
sudo systemctl start nomad
# wait for nomad to be live again before querying it
while :
do
nomad status 2> /dev/null && break
sleep 1
done
new_expiry=$(nomad alloc exec $alloc_id cat /secrets/certificate.crt |
openssl x509 -noout -enddate |
cut -d= -f2 |
xargs -I^ date -d '^' +%s)
echo "new expiry: $new_expiry"
if [[ "$new_expiry" != "$expiry" ]]; then
echo "certificate was renewed!"
else
echo "certificate was not renewed!"
fi
The results look good there too!
$ ./test2.sh d9b
now: 1607546652
expires at: 1607546709
waiting 30s to stop nomad...
stopping Nomad...
waiting 26s for cert to expire, +5s...
restarting nomad...
ID Type Priority Status Submit Date
example service 50 running 2020-12-09T20:41:46Z
new expiry: 1607546776
certificate was renewed
Logs, for posterity.
logs
2020/12/09 20:45:16.365188 [INFO] (runner) creating watcher
2020/12/09 20:45:16.365289 [INFO] (runner) starting
2020/12/09 20:45:16.365296 [DEBUG] (runner) running initial templates
2020/12/09 20:45:16.365299 [DEBUG] (runner) initiating run
2020/12/09 20:45:16.365302 [DEBUG] (runner) checking template 9b5c5086546eb25832bf8367ee8a95da
2020/12/09 20:45:16.365560 [DEBUG] (runner) missing data for 1 dependencies
2020/12/09 20:45:16.365571 [DEBUG] (runner) missing dependency: vault.write(pki-test9491/issue/nomad -> 6d0de468)
2020/12/09 20:45:16.365634 [DEBUG] (runner) add used dependency vault.write(pki-test9491/issue/nomad -> 6d0de468) to missing since isLeader but do not have a wat
2020/12/09 20:45:16.365670 [DEBUG] (runner) was not watching 1 dependencies
2020/12/09 20:45:16.365676 [DEBUG] (watcher) adding vault.write(pki-test9491/issue/nomad -> 6d0de468)
2020/12/09 20:45:16.365687 [DEBUG] (runner) diffing and updating dependencies
2020/12/09 20:45:16.365691 [DEBUG] (runner) watching 1 dependencies
2020/12/09 20:45:16.473485 [DEBUG] (runner) receiving dependency vault.write(pki-test9491/issue/nomad -> 6d0de468)
2020/12/09 20:45:16.473554 [DEBUG] (runner) initiating run
2020/12/09 20:45:16.473562 [DEBUG] (runner) checking template 9b5c5086546eb25832bf8367ee8a95da
2020/12/09 20:45:16.474122 [DEBUG] (runner) rendering "(dynamic)" => "/var/nomad/alloc/d9b1a5c8-741d-228a-3b8f-c81118041b72/task/secrets/certificate.crt"
2020/12/09 20:45:16.474241 [INFO] (runner) rendered "(dynamic)" => "/var/nomad/alloc/d9b1a5c8-741d-228a-3b8f-c81118041b72/task/secrets/certificate.crt"
2020/12/09 20:45:16.474249 [DEBUG] (runner) diffing and updating dependencies
2020/12/09 20:45:16.474256 [DEBUG] (runner) vault.write(pki-test9491/issue/nomad -> 6d0de468) is still needed
2020/12/09 20:45:16.474262 [DEBUG] (runner) watching 1 dependencies
2020/12/09 20:45:16.474267 [DEBUG] (runner) all templates rendered
So we see the secret being renewed as expected. But what about the allocation itself? Let's change out the change_mode to restart:
$ ./test2.sh eb0
now: 1607547812
expires at: 1607547841
only 29s left till expiry, stopping now...
stopping Nomad...
waiting 28s for cert to expire, +5s...
restarting nomad...
ID Type Priority Status Submit Date
example service 50 running 2020-12-09T21:02:56Z
new expiry: 1607547910
certificate was renewed!
But the task isn't being restarted:
$ nomad alloc status eb0
...
Recent Events:
Time Type Description
2020-12-09T21:03:02Z Started Task started by client
2020-12-09T21:03:01Z Task Setup Building Task Directory
2020-12-09T21:02:56Z Received Task received by client
If we wait a minute for that new cert to expire, we're back in business:
$ nomad alloc status eb0
...
Recent Events:
Time Type Description
2020-12-09T21:05:17Z Started Task started by client
2020-12-09T21:05:14Z Restarting Task restarting in 0s
2020-12-09T21:05:14Z Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2020-12-09T21:05:09Z Restart Signaled Template with change_mode restart re-rendered
2020-12-09T21:03:02Z Started Task started by client
2020-12-09T21:03:01Z Task Setup Building Task Directory
2020-12-09T21:02:56Z Received Task received by client
So it appears that the problem is not that Nomad is losing access to the secret engine, or that consul-template isn't trying to renew the secret when it's supposed to. It's that when we restart the Nomad client we don't treat a change to the secret as a trigger for a task restart (or signal). We have a new secret, but haven't told the workload that there is one. I feel reasonably certain about this conclusion but don't yet have an answer as to how to fix it. That'll be what I dig into tomorrow morning.
Some diagrams summarizing the problem... mostly for me and so I can share this with colleagues to figure out the best fix.
Here's how template rendering with dynamic secrets is supposed to work:

Here's how it's working when the client is restarted. Note that the template runner is getting the new secret, but that we never trigger the change_mode so at some point in time later when the TTL of the original secret expires, our task will be unhappy:

This is fantastic stuff @tgross - thanks for taking the time to write such a thorough analysis. I need to read it a few times to absorb it all, but the first pass makes sense to me and I understand what you're saying and it clearly lines up with the behavior we have seen
After doing a bit of smoke testing and printf debugging, it looks like the RenderEvent that we get from consul-template after the rendering seems to give us the hooks we'd need here.
// WouldRender determines if the template would have been rendered. A template
// would have been rendered if all the dependencies are satisfied, but may
// not have actually rendered if the file was already present or if an error
// occurred when trying to write the file.
WouldRender bool
// DidRender determines if the Template was actually written to disk. In dry
// mode, this will always be false, since templates are not written to disk
// in dry mode. A template is only rendered to disk if all dependencies are
// satisfied and the template is not already in place with the same contents.
DidRender bool
The first time a template is rendered, these fields will both be true. After a client restart, if the secret hasn't changed DidRender will be false. But because a dynamic secret is always changed (because it's a write and not a read), DidRender will be true after a client restart if the task has a dynamic secret. So that can give the task runner the information we need to know that we need to trigger the change_mode of the task, even if we're in the prestart. That does mean that tasks with dynamic secrets would always be restarted if the client restarts. That's not ideal but it at least prevents a situation where the secret becomes suddenly stale and causes application-level issues.
"That does mean that tasks with dynamic secrets would always be restarted if the client restarts."
FWIW that's exactly the hack we are putting in place. We using consul template task in jobs with secrets to watch for a KV entry for a node (that we change every time a nomad agent restarts) then using that to restart the container. Effectively the same thing but it impacts all our jobs as each one needs to implement this hack. I can say for sure that we would rather have nomad restart our containers with secrets than having to handle that ourselves in our jobs :)
Hi @tgross Thanks for this great analysis! Regarding the potential solution in this comment RenderEvent that we get from consul-template after the rendering seems to give us the hooks we'd need here., seems like that's pretty much https://github.com/hashicorp/nomad/issues/4226#issuecomment-422600477 has implemented, is that correct?
Yes. But that patch is against an older version of Nomad that had a good bit of changes. #6324 from the same author is more recent (although doesn't merge cleanly with the current HEAD). There are some problems in the details of that approach:
handleTemplateRerenders but with some small tweaks, which is likely to cause a maintenance burden as the implementations drift apart.restore state is saved is not correct (because prestart hooks can fail).So rather than trying to inject state into the hooks that bridges restarts, I'm working up a patch that detects whether there's an existing driver handle present (which covers a wider range of edge cases). Then I'm factoring out the on-change handling so that both the handleFirstRender and handleTemplateRerenders can use it.
in that implementation all tasks with templates will get restarted on client restart, which we don't need to do
💯 agree, thanks for the update!
My working branch for the patch is https://github.com/hashicorp/nomad/compare/b-template-first-render. I've only smoke-tested this locally and haven't written automated testing to explore the edge cases (so definitely don't run this build anywhere you care about! 😀 ). Hoping to wrap that up early next week.
That's great. Thanks @tgross for a detailed write up and quick turnaround.
That branch works fine in automated unit testing. And when I tried running it through end-to-end the change signal is being triggered as we wanted. This works perfectly with change_mode = "signal".
But with change_mode = "restart" the task restart is causing a problem with other task hooks. So the restart fails and then Nomad will reschedule the allocation. That's not the behavior we want here. I suspect that we didn't really design the prestart hooks to allow for restarts _within_ the prestart process, as opposed to simply exiting and getting unwound by the task runner. The logmon error we're hitting here happens _before_ the template hook, which implies some interesting ordering issues. That'll be the next thing I dig into.
Alloc status for the failed alloc looks like:
Recent Events:
Time Type Description
2020-12-14T19:35:33Z Alloc Unhealthy Unhealthy because of failed task
2020-12-14T19:35:33Z Not Restarting Error was unrecoverable
2020-12-14T19:35:33Z Task hook failed logmon: rpc error: code = Unknown desc = failed to create stdout logfile for "task.stdout": open /var/nomad/alloc/25c75368-4a19-8796-7767-5905dae4e76d/alloc/logs: no such file or directory
2020-12-14T19:35:16Z Restarting Task restarting in 16.491523779s
2020-12-14T19:35:16Z Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2020-12-14T19:35:12Z Started Task started by client
2020-12-14T19:35:12Z Task Setup Building Task Directory
2020-12-14T19:35:12Z Received Task received by client
2020-12-14T19:34:07Z Started Task started by client
2020-12-14T19:34:06Z Task Setup Building Task Directory
_Edit_: ☹️ I've only hit the above problem the one time and haven't been able to reproduce it. Which may mean it's racy or it may mean that there's a completely unrelated bug in the logmon code.
_Edit 2_: Doing some soak testing of this to make sure but it looks like the logmon thing is unrelated.
"Edit: ☹️ I've only hit the above problem the one time and haven't been able to reproduce it. Which may mean it's racy or it may mean that there's a completely unrelated bug in the logmon code.
Edit 2: Doing some soak testing of this to make sure but it looks like the logmon thing is unrelated."
Better now than later! Appreciate the updates. We are tracking this one really closely
After a long soak test it looks like the logmon error is a little misleading and the problem really is in the task restart, so I've got to rework the solution I've got in #9636. I haven't quite pinned down the problem yet.
(The fun bit of this is that turning up logging to trace changes the timing and therefore the symptoms!)
Ok, have that conclusively figured out now with a whole lot of log grubbing. It's mostly an artifact of the test rig; the PKI cert expires after 1 minute so my test rig has to line up the restart such that it lands before the cert expires. But that timing requirement can cause the client to be shutdown long enough that when it comes back up the server thinks the allocation is lost and tries to reschedule it. (I'd missed the single log line from the server asking for the alloc to stop among the ~30k entries. 🤦 )
There's a kind of gross but mostly unavoidable scenario here where because I'm testing with a single client, the scheduler will always place it on the same host and when that's happening right on top of a client restart we can be in the middle of trying to start the alloc while the shutdown notice from the server comes, which results in a taskrunner hook state that isn't viable and we're forced to reschedule again. It looks like there's a narrow possibility of running into this any time the client is shut down long enough to be lost, if the task is actually still running. But the reschedule is everything working as intended from that state.
Fixed in https://github.com/hashicorp/nomad/pull/9636, which will ship in the upcoming Nomad 1.0.2.
This is great stuff @tgross, thanks for all the work on it - we will definitely be upgrading to 1.0.2 and will be using this :)
Most helpful comment
Some diagrams summarizing the problem... mostly for me and so I can share this with colleagues to figure out the best fix.
Here's how template rendering with dynamic secrets is supposed to work:

Here's how it's working when the client is restarted. Note that the template runner is getting the new secret, but that we never trigger the
change_modeso at some point in time later when the TTL of the original secret expires, our task will be unhappy: