Nomad: Nomad not cleaning consul services/healthcheck after failed deploy [0.9.x]

Created on 3 Jun 2019 · 7Comments · Source: hashicorp/nomad

Nomad version

Noticed on 0.9.0. Problem present on 0.9.2-rc1 and 0.9.1 as well

nomad version
Nomad v0.9.2-rc1 (98e75251316173bd2e6f6392dceea5f2d2182e17)

Operating system and Environment details

Running on Ubuntu
Nomad single node, both client and server
Consul single node, both client and server (v1.4.4) - tested with v1.5.1 as well
Vault single node (v1.1.0)
Tried with and without ACLs enabled (both nomad and consul)

Issue

When provisioning a job, that contains a template, referencing to a vault secret, the following actions happen:

Deployments start
job gets scheduled, task gets allocated
Service gets registered in consul
Nomad deployment fails, with error:

Template failed: vault.read(aws/creds/my-creds): vault.read(aws/creds/my-creds): Error making API request. URL: GET http://127.0.0.1:8200/v1/aws/creds/my-creds Code: 403. Errors: * 1 error occurred: * permission denied

Healthcheck is still registered in consul, failing

Reproduction steps

Provide a cluster with a running nomad, consul and vault
Run nomad, ensure it's registered properly in consul
Try nomad run my-job.nomad (see file given below).
You should see nomad registering a first healthcheck as failing in consul (all good here)
Allocation fails
See consul services, that still lists the service with a failing healthchecks.

Job file (if appropriate)

This has been tested and reproduced with the following job file:

job "my-task2" {
  datacenters = ["mickael"]
  type = "service"
  group "my-task2" {
    count = 1
    task "my-task" {
      driver = "docker"
      shutdown_delay = "10s"
      resources {
        memory = 1500
        cpu = 2000
        network {
          port "service_port" {
          }
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH
        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        image        = "nginx"
        port_map {
          service_port = 80
        }
        labels {
          build_id = "1"
        }
        dns_servers = [
          "172.17.0.1"
        ]
      }

      service {
        name = "my-task"
        tags = [
                 "nomad",
                 "my-task",
          ]
        port = "service_port"
        check {
          name     = "serviceHealth_my-task"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = [
                    "nomad_my-task",
                   ]
        change_mode = "restart"
      }
    }
  }
}

Also reproduced with a custom upgrade/reschedule policy, which lead to two allocation in total, and to two failing healthchecks registered in consul. (We could match the consul HC with nomad allocations via the port registered.)

    reschedule {
        attempts       = 15
        interval       = "30m"
        delay          = "5s"
        delay_function = "exponential"
        max_delay      = "180s"
        unlimited      = false
    }

    update {
      max_parallel = 1
      min_healthy_time = "10s"
      healthy_deadline = "570s"
      progress_deadline = "575s"
      auto_revert = true
    }

Logs

Not much information was found in the logs.
Same for consul, logs are clean.
nomad.log

Consul configured:

With/without ACLs whitelist mode (tested both)
with encryption

Nomad configured:

With/Without ACLs (tested both)
TLS enabled
with encryption
vault and consul integration enabled

stagneeds-investigation themclient themconsul typbug

Source

MagicMicky

👍14

Most helpful comment

@endocrimes -- thank you so much!

toadstule on 13 Jun 2019

❤5

All 7 comments

Hi, this happened to me too. I was able to reproduce it consistently by doing the following:

# Start vault server (tested with v1.1.0)
$ vault server -dev -dev-root-token-id=root

# Start consul server (tested with v1.5.1)
$ consul agent -dev

# Start nomad server and client (tested with 0.9.2)
$ nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200

Use a job (similar to the one provided by @MagicMicky

job "my-job" {
  datacenters = ["dc1"]
  type        = "service"

  group "my-group" {
    count = 1

    task "my-task" {
      driver         = "raw_exec"
      shutdown_delay = "10s"

      resources {
        memory = 150
        cpu    = 200

        network {
          port "service" {}
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH

        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        command = "echo"
        args    = ["${NOMAD_PORT_service}"]
      }

      service {
        name = "my-task-service"
        tags = ["nomad", "my-task"]
        port = "service"

        check {
          name     = "serviceHealth_my-task-failing"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = ["nomad_my-task"]
        change_mode = "restart"
      }
    }
  }
}

# Run the job
nomad run job.hcl

In the first run it won't get inserted in consul, so it works fine, but if we change something in the job (like the memory for example) and run it again, it will be registered and remain there.

failing

If we keep modifiing something that makes the task run again, new checks appears

failing2

jorgemarey on 6 Jun 2019

Hi again, I launched nomad with TRACE logging and saw the following:

nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200 -log-level=trace | grep consul

# nomad run job.hcl

2019/06/06 12:14:45.771551 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/db9dba9b-8348-505d-0a9b-148576a25fb4/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# modify job and  nomad run

    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services start="2019-06-06 12:15:09.461388264 +0200 CEST m=+103.632854633"
    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services end="2019-06-06 12:15:09.461460589 +0200 CEST m=+103.632926941" duration=72.308µs
    2019-06-06T12:15:09.465+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
    2019/06/06 12:15:09.499951 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/f70e1b93-14a8-2091-5b46-ef1d1b962501/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# nomad stop -purge job

    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services start="2019-06-06 12:15:47.406246758 +0200 CEST m=+141.577713188"
    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services end="2019-06-06 12:15:47.40648447 +0200 CEST m=+141.577950904" duration=237.716µs
    2019-06-06T12:15:47.416+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0

In the end the service is still registered with 2 checks.

failing

I guess this has to do with the service updatehook being called.

jorgemarey on 6 Jun 2019

We are seeing this as well

aphill70 on 7 Jun 2019

We were able to reproduce this behavior as well using this simple job:

job "test" {
    region = "a1"
    datacenters = ["a1"]
    type = "service"
    group "tg" {
        count = 1
        task "tt" {
            driver = "docker"
            config {
                image = "does/not/exist"
            }
            service {
                port = "http"
            }
            resources {
                network {
                    port "http" {}
                }
            }
            vault {
                policies = ["doesnotexist"]
            }
        }
    }
}

The docker image does not exist, so the job continues to restart. With each restart, we get another service entry in Consul. (Because there is no defined check, they all show up as healthy services in Consul.)

nomad job stop -purge test will stop the job and remove it from Nomad, but the service entries remain in Consul.

The problem does not show up if we remove the vault stanza -- even though the policy doesn't exist and we never try to access a Vault secret.

Service entries in Consul are cleaned up if the Nomad client is restarted on the node where the job was attempting to run.

Screen Shot 2019-06-07 at 2 00 17 PM

Screen Shot 2019-06-07 at 2 00 26 PM

toadstule on 7 Jun 2019

👍2

@endocrimes -- thank you so much!

toadstule on 13 Jun 2019

❤5

Is this merged into the 0.9.4 RC?

mikeblum on 30 Jul 2019

This was released in 0.9.3 (https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#093-june-12-2019)

toadstule on 30 Jul 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Improvement: interpolation ${local.git.tag} / ${local.git.branch} / ${local.git.commit}

samber · 3Comments

[question] Detailed Documentation about Task Status and Desired Status

Smuerdt · 3Comments

trying to stop running container

jippi · 3Comments

[thinking] Periodic batch job trigger immediately option

jrasell · 3Comments

Enable ability to send username/password for Basic Auth in service check stanza

leowmjw · 3Comments