Nomad: 0.10.4
Consul: 1.7.0
Consul ACLs: Enabled
Rotating a consul token causes nomad agent to be unable to use consul connect for any new jobs, until you reboot the agent's OS.
agent_prefix "" {
policy = "write"
}
node_prefix "" {
policy = "write"
}
service_prefix "" {
policy = "write"
}
acl = "write"
{
"consul": {
"token": "123456"
}
...
4 Restart the Nomad Agents
service nomad restart
job "countdash" {
datacenters = ["dc1"]
group "api" {
network {
mode = "bridge"
}
service {
name = "count-api"
port = "9001"
connect {
sidecar_service {}
}
}
task "web" {
driver = "docker"
config {
image = "hashicorpnomad/counter-api:v1"
}
}
}
group "dashboard" {
network {
mode ="bridge"
port "http" {
static = 9002
to = 9002
}
}
service {
name = "count-dashboard"
port = "9002"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "count-api"
local_bind_port = 8080
}
}
}
}
}
task "dashboard" {
driver = "docker"
env {
COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
}
config {
image = "hashicorpnomad/counter-dashboard:v1"
}
}
}
}
/etc/nomad/config.json{
"consul": {
"token": "9876543"
}
...
service nomad restart
You will find that the job will start successfully, however you will be unable to connect to the dashboard. Also any future job that requires consul connect will also fail on that agent.
Rotating a token should not require draining or rebooting a nomad agent
When a token changes, nomad agent enters an unusable state that requires a reboot to fix
After rotating a consul token, do a rolling reboot of the entire nomad agent cluster
The following attempts to revive the now broken nomad agent are unsuccessful
service nomad restart
service consul restart
service docker restart
iptables -F CNI-FORWARD
The only way I've been able to recover the nomad agent is to physically reboot the machine.
I don't have a quick method to reproduce locally, but I have recorded a video of me reproducing it. I have reproduced it twice now in my environment.
https://youtu.be/OrVhA-gh4nM (Recommend watching at 4k)
I will attempt to reproduce again and capture the logs. A couple of interesting log enteries that happen at about the same time.
Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]: 2020-03-28T21:54:06.667Z [INFO] client.gc: marking allocation for GC: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36
Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]: 2020-03-28T21:54:06.674Z [ERROR] client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 alloc=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 error="cni plugin not initialized"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: 2020-03-28T21:56:22.978Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-fd23333b-4d04-163c-c223-ad8c7a9b1eb4-group-api-count-api-9001: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: "
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: 2020-03-28T21:56:22.979Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: 2020-03-28T21:56:22.979Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api reason="Restart within policy" delay=15.298581092s
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]: 2020-03-28T21:56:24.078Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=1f1cc796-98b6-8d01-ccfa-3052ede8df49 task=connect-proxy-count-dashboard error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-1f1cc796-98b6-8d01-ccfa-3052ede8df49-group-dashboard-count-dashboard-9002: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]: "
* Update 1 *
The biggest work around to this issue is to avoid putting tokens in /etc/nomad/config.json. It is better to create a dedicated policy and apply the token to that policy.
Thank you for taking the time to report this, @spuder .
I think have been able to reproduce the underlying bad behavior, which is that sometimes after restarting Nomad server, something causes it to become unable to manage the network namespaces necessary for Connect. Quite possibly related to #7536 (again, thanks!).
I'm working on a minimal reproduction to help track down the problem.
We've narrowed this down to a problem with our use of the go-cni plugin. I've updated the title to better reflect what's happening - reproduction is as simple as
1) run nomad with with a useable connect configuration (no ACLs required) (e.g. sudo nomad agent -dev-connect)
2) run a connect job that makes use of static port forwarding (e.g. nomad job init -connect -short && nomad job run example.nomad)
3) restart nomad agent
4) stop the job (first cni plugin error messages appear)
5) start the job (static port mapping no longer works)
Hey @spuder thanks again for the detailed bug report. We believe we鈥檝e fixed this in the release candidate for 0.11 that was just announced. https://releases.hashicorp.com/nomad/0.11.0-rc1/
Most helpful comment
Hey @spuder thanks again for the detailed bug report. We believe we鈥檝e fixed this in the release candidate for 0.11 that was just announced. https://releases.hashicorp.com/nomad/0.11.0-rc1/