Nomad: Nomad not cleaning consul services/healthcheck after failed deploy [0.9.x]

Created on 3 Jun 2019  路  7Comments  路  Source: hashicorp/nomad

Nomad version

Noticed on 0.9.0. Problem present on 0.9.2-rc1 and 0.9.1 as well

nomad version
Nomad v0.9.2-rc1 (98e75251316173bd2e6f6392dceea5f2d2182e17)

Operating system and Environment details

Running on Ubuntu
Nomad single node, both client and server
Consul single node, both client and server (v1.4.4) - tested with v1.5.1 as well
Vault single node (v1.1.0)
Tried with and without ACLs enabled (both nomad and consul)

Issue

When provisioning a job, that contains a template, referencing to a vault secret, the following actions happen:

  • Deployments start
  • job gets scheduled, task gets allocated
  • Service gets registered in consul
  • Nomad deployment fails, with error:
Template failed: vault.read(aws/creds/my-creds): vault.read(aws/creds/my-creds): Error making API request. URL: GET http://127.0.0.1:8200/v1/aws/creds/my-creds Code: 403. Errors: * 1 error occurred: * permission denied
  • Healthcheck is still registered in consul, failing

Reproduction steps

  • Provide a cluster with a running nomad, consul and vault
  • Run nomad, ensure it's registered properly in consul
  • Try nomad run my-job.nomad (see file given below).
  • You should see nomad registering a first healthcheck as failing in consul (all good here)
  • Allocation fails
  • See consul services, that still lists the service with a failing healthchecks.

Job file (if appropriate)

This has been tested and reproduced with the following job file:

job "my-task2" {
  datacenters = ["mickael"]
  type = "service"
  group "my-task2" {
    count = 1
    task "my-task" {
      driver = "docker"
      shutdown_delay = "10s"
      resources {
        memory = 1500
        cpu = 2000
        network {
          port "service_port" {
          }
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH
        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        image        = "nginx"
        port_map {
          service_port = 80
        }
        labels {
          build_id = "1"
        }
        dns_servers = [
          "172.17.0.1"
        ]
      }

      service {
        name = "my-task"
        tags = [
                 "nomad",
                 "my-task",
          ]
        port = "service_port"
        check {
          name     = "serviceHealth_my-task"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = [
                    "nomad_my-task",
                   ]
        change_mode = "restart"
      }
    }
  }
}

Also reproduced with a custom upgrade/reschedule policy, which lead to two allocation in total, and to two failing healthchecks registered in consul. (We could match the consul HC with nomad allocations via the port registered.)

    reschedule {
        attempts       = 15
        interval       = "30m"
        delay          = "5s"
        delay_function = "exponential"
        max_delay      = "180s"
        unlimited      = false
    }

    update {
      max_parallel = 1
      min_healthy_time = "10s"
      healthy_deadline = "570s"
      progress_deadline = "575s"
      auto_revert = true
    }

Logs

Not much information was found in the logs.
Same for consul, logs are clean.
nomad.log

Consul configured:

  • With/without ACLs whitelist mode (tested both)
  • with encryption

Nomad configured:

  • With/Without ACLs (tested both)
  • TLS enabled
  • with encryption
  • vault and consul integration enabled
stagneeds-investigation themclient themconsul typbug

Most helpful comment

@endocrimes -- thank you so much!

All 7 comments

Hi, this happened to me too. I was able to reproduce it consistently by doing the following:

# Start vault server (tested with v1.1.0)
$ vault server -dev -dev-root-token-id=root

# Start consul server (tested with v1.5.1)
$ consul agent -dev

# Start nomad server and client (tested with 0.9.2)
$ nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200

Use a job (similar to the one provided by @MagicMicky

job "my-job" {
  datacenters = ["dc1"]
  type        = "service"

  group "my-group" {
    count = 1

    task "my-task" {
      driver         = "raw_exec"
      shutdown_delay = "10s"

      resources {
        memory = 150
        cpu    = 200

        network {
          port "service" {}
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH

        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        command = "echo"
        args    = ["${NOMAD_PORT_service}"]
      }

      service {
        name = "my-task-service"
        tags = ["nomad", "my-task"]
        port = "service"

        check {
          name     = "serviceHealth_my-task-failing"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = ["nomad_my-task"]
        change_mode = "restart"
      }
    }
  }
}
# Run the job
nomad run job.hcl

In the first run it won't get inserted in consul, so it works fine, but if we change something in the job (like the memory for example) and run it again, it will be registered and remain there.

failing

If we keep modifiing something that makes the task run again, new checks appears

failing2

Hi again, I launched nomad with TRACE logging and saw the following:

nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200 -log-level=trace | grep consul
# nomad run job.hcl

2019/06/06 12:14:45.771551 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/db9dba9b-8348-505d-0a9b-148576a25fb4/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# modify job and  nomad run

    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services start="2019-06-06 12:15:09.461388264 +0200 CEST m=+103.632854633"
    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services end="2019-06-06 12:15:09.461460589 +0200 CEST m=+103.632926941" duration=72.308碌s
    2019-06-06T12:15:09.465+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
    2019/06/06 12:15:09.499951 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/f70e1b93-14a8-2091-5b46-ef1d1b962501/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# nomad stop -purge job

    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services start="2019-06-06 12:15:47.406246758 +0200 CEST m=+141.577713188"
    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services end="2019-06-06 12:15:47.40648447 +0200 CEST m=+141.577950904" duration=237.716碌s
    2019-06-06T12:15:47.416+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0

In the end the service is still registered with 2 checks.

failing

I guess this has to do with the service updatehook being called.

We are seeing this as well

We were able to reproduce this behavior as well using this simple job:

job "test" {
    region = "a1"
    datacenters = ["a1"]
    type = "service"
    group "tg" {
        count = 1
        task "tt" {
            driver = "docker"
            config {
                image = "does/not/exist"
            }
            service {
                port = "http"
            }
            resources {
                network {
                    port "http" {}
                }
            }
            vault {
                policies = ["doesnotexist"]
            }
        }
    }
}

The docker image does not exist, so the job continues to restart. With each restart, we get another service entry in Consul. (Because there is no defined check, they all show up as healthy services in Consul.)

nomad job stop -purge test will stop the job and remove it from Nomad, but the service entries remain in Consul.

The problem does not show up if we remove the vault stanza -- even though the policy doesn't exist and we never try to access a Vault secret.

Service entries in Consul are cleaned up if the Nomad client is restarted on the node where the job was attempting to run.

Screen Shot 2019-06-07 at 2 00 17 PM

Screen Shot 2019-06-07 at 2 00 26 PM

@endocrimes -- thank you so much!

Is this merged into the 0.9.4 RC?

This was released in 0.9.3 (https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#093-june-12-2019)

Was this page helpful?
0 / 5 - 0 ratings