Nomad: Restart of Nomad Client causes port forwarding issues upon restart of running Connect jobs

Created on 29 Mar 2020  路  3Comments  路  Source: hashicorp/nomad

Nomad version

Nomad: 0.10.4
Consul: 1.7.0
Consul ACLs: Enabled

Issue

Rotating a consul token causes nomad agent to be unable to use consul connect for any new jobs, until you reboot the agent's OS.

Reproduction steps

  1. Setup consul-connect on a cluster with ACLs enabled.
  2. Create the following policy and apply it to the nomad servers
agent_prefix "" {
    policy = "write"
}

node_prefix "" {
    policy = "write"
}

service_prefix "" {
    policy = "write"
}

acl = "write"

  1. Create a Token with the policy. Save the TokenID to the Nomad Agents in /etc/nomad/config.json
    (I've since learned that this is not best practice. I"m leaving it here for consistency)
{
  "consul": {
    "token": "123456"
  }
...

4 Restart the Nomad Agents

service nomad restart
  1. Start the countdash job as shown in the nomad documentation
job "countdash" {
   datacenters = ["dc1"]
   group "api" {
     network {
       mode = "bridge"
     }

     service {
       name = "count-api"
       port = "9001"

       connect {
         sidecar_service {}
       }
     }

     task "web" {
       driver = "docker"
       config {
         image = "hashicorpnomad/counter-api:v1"
       }
     }
   }

   group "dashboard" {
     network {
       mode ="bridge"
       port "http" {
         static = 9002
         to     = 9002
       }
     }

     service {
       name = "count-dashboard"
       port = "9002"

       connect {
         sidecar_service {
           proxy {
             upstreams {
               destination_name = "count-api"
               local_bind_port = 8080
             }
           }
         }
       }
     }

     task "dashboard" {
       driver = "docker"
       env {
         COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
       }
       config {
         image = "hashicorpnomad/counter-dashboard:v1"
       }
     }
   }
 }
  1. While the job is running, create a new token. Save that new token to the Nomad Agent
    /etc/nomad/config.json
{
  "consul": {
    "token": "9876543"
  }
...
  1. Restart the nomad agent
service nomad restart
  1. Attempt to redeploy/restart the countdash job (or start a new job that uses consul-connect)

You will find that the job will start successfully, however you will be unable to connect to the dashboard. Also any future job that requires consul connect will also fail on that agent.

Expected Behavior

Rotating a token should not require draining or rebooting a nomad agent

Actual behavior

When a token changes, nomad agent enters an unusable state that requires a reboot to fix

Work Around

After rotating a consul token, do a rolling reboot of the entire nomad agent cluster

Recovering

The following attempts to revive the now broken nomad agent are unsuccessful

service nomad restart
service consul restart
service docker restart
iptables -F CNI-FORWARD

The only way I've been able to recover the nomad agent is to physically reboot the machine.

I don't have a quick method to reproduce locally, but I have recorded a video of me reproducing it. I have reproduced it twice now in my environment.

https://youtu.be/OrVhA-gh4nM (Recommend watching at 4k)

Additional information

I will attempt to reproduce again and capture the logs. A couple of interesting log enteries that happen at about the same time.

Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]:     2020-03-28T21:54:06.667Z [INFO]  client.gc: marking allocation for GC: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36
Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]:     2020-03-28T21:54:06.674Z [ERROR] client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 alloc=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 error="cni plugin not initialized"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.978Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-fd23333b-4d04-163c-c223-ad8c7a9b1eb4-group-api-count-api-9001: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: "
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.979Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.979Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api reason="Restart within policy" delay=15.298581092s
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:24.078Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=1f1cc796-98b6-8d01-ccfa-3052ede8df49 task=connect-proxy-count-dashboard error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-1f1cc796-98b6-8d01-ccfa-3052ede8df49-group-dashboard-count-dashboard-9002: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]: "

* Update 1 *

The biggest work around to this issue is to avoid putting tokens in /etc/nomad/config.json. It is better to create a dedicated policy and apply the token to that policy.

stagneeds-investigation themconsuconnect typbug

Most helpful comment

Hey @spuder thanks again for the detailed bug report. We believe we鈥檝e fixed this in the release candidate for 0.11 that was just announced. https://releases.hashicorp.com/nomad/0.11.0-rc1/

All 3 comments

Thank you for taking the time to report this, @spuder .

I think have been able to reproduce the underlying bad behavior, which is that sometimes after restarting Nomad server, something causes it to become unable to manage the network namespaces necessary for Connect. Quite possibly related to #7536 (again, thanks!).

I'm working on a minimal reproduction to help track down the problem.

We've narrowed this down to a problem with our use of the go-cni plugin. I've updated the title to better reflect what's happening - reproduction is as simple as
1) run nomad with with a useable connect configuration (no ACLs required) (e.g. sudo nomad agent -dev-connect)
2) run a connect job that makes use of static port forwarding (e.g. nomad job init -connect -short && nomad job run example.nomad)
3) restart nomad agent
4) stop the job (first cni plugin error messages appear)
5) start the job (static port mapping no longer works)

Hey @spuder thanks again for the detailed bug report. We believe we鈥檝e fixed this in the release candidate for 0.11 that was just announced. https://releases.hashicorp.com/nomad/0.11.0-rc1/

Was this page helpful?
0 / 5 - 0 ratings