Noticed on 0.9.0. Problem present on 0.9.2-rc1 and 0.9.1 as well
nomad version
Nomad v0.9.2-rc1 (98e75251316173bd2e6f6392dceea5f2d2182e17)
Running on Ubuntu
Nomad single node, both client and server
Consul single node, both client and server (v1.4.4) - tested with v1.5.1 as well
Vault single node (v1.1.0)
Tried with and without ACLs enabled (both nomad and consul)
When provisioning a job, that contains a template, referencing to a vault secret, the following actions happen:
Template failed: vault.read(aws/creds/my-creds): vault.read(aws/creds/my-creds): Error making API request. URL: GET http://127.0.0.1:8200/v1/aws/creds/my-creds Code: 403. Errors: * 1 error occurred: * permission denied
nomad run my-job.nomad (see file given below).This has been tested and reproduced with the following job file:
job "my-task2" {
datacenters = ["mickael"]
type = "service"
group "my-task2" {
count = 1
task "my-task" {
driver = "docker"
shutdown_delay = "10s"
resources {
memory = 1500
cpu = 2000
network {
port "service_port" {
}
}
}
template {
data = <<EOH
{{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
{{ end }}
EOH
destination = "secrets/creds.env"
env = true
change_mode = "restart"
}
config {
image = "nginx"
port_map {
service_port = 80
}
labels {
build_id = "1"
}
dns_servers = [
"172.17.0.1"
]
}
service {
name = "my-task"
tags = [
"nomad",
"my-task",
]
port = "service_port"
check {
name = "serviceHealth_my-task"
type = "http"
path = "/"
interval = "10s"
timeout = "2s"
}
}
vault {
policies = [
"nomad_my-task",
]
change_mode = "restart"
}
}
}
}
Also reproduced with a custom upgrade/reschedule policy, which lead to two allocation in total, and to two failing healthchecks registered in consul. (We could match the consul HC with nomad allocations via the port registered.)
reschedule {
attempts = 15
interval = "30m"
delay = "5s"
delay_function = "exponential"
max_delay = "180s"
unlimited = false
}
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "570s"
progress_deadline = "575s"
auto_revert = true
}
Not much information was found in the logs.
Same for consul, logs are clean.
nomad.log
Consul configured:
Nomad configured:
Hi, this happened to me too. I was able to reproduce it consistently by doing the following:
# Start vault server (tested with v1.1.0)
$ vault server -dev -dev-root-token-id=root
# Start consul server (tested with v1.5.1)
$ consul agent -dev
# Start nomad server and client (tested with 0.9.2)
$ nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200
Use a job (similar to the one provided by @MagicMicky
job "my-job" {
datacenters = ["dc1"]
type = "service"
group "my-group" {
count = 1
task "my-task" {
driver = "raw_exec"
shutdown_delay = "10s"
resources {
memory = 150
cpu = 200
network {
port "service" {}
}
}
template {
data = <<EOH
{{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
{{ end }}
EOH
destination = "secrets/creds.env"
env = true
change_mode = "restart"
}
config {
command = "echo"
args = ["${NOMAD_PORT_service}"]
}
service {
name = "my-task-service"
tags = ["nomad", "my-task"]
port = "service"
check {
name = "serviceHealth_my-task-failing"
type = "http"
path = "/"
interval = "10s"
timeout = "2s"
}
}
vault {
policies = ["nomad_my-task"]
change_mode = "restart"
}
}
}
}
# Run the job
nomad run job.hcl
In the first run it won't get inserted in consul, so it works fine, but if we change something in the job (like the memory for example) and run it again, it will be registered and remain there.

If we keep modifiing something that makes the task run again, new checks appears

Hi again, I launched nomad with TRACE logging and saw the following:
nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200 -log-level=trace | grep consul
# nomad run job.hcl
2019/06/06 12:14:45.771551 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":" {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n {{ end }}\n ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/db9dba9b-8348-505d-0a9b-148576a25fb4/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}
# modify job and nomad run
2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services start="2019-06-06 12:15:09.461388264 +0200 CEST m=+103.632854633"
2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services end="2019-06-06 12:15:09.461460589 +0200 CEST m=+103.632926941" duration=72.308碌s
2019-06-06T12:15:09.465+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
2019/06/06 12:15:09.499951 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":" {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n {{ end }}\n ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/f70e1b93-14a8-2091-5b46-ef1d1b962501/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}
# nomad stop -purge job
2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services start="2019-06-06 12:15:47.406246758 +0200 CEST m=+141.577713188"
2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services end="2019-06-06 12:15:47.40648447 +0200 CEST m=+141.577950904" duration=237.716碌s
2019-06-06T12:15:47.416+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
In the end the service is still registered with 2 checks.

I guess this has to do with the service updatehook being called.
We are seeing this as well
We were able to reproduce this behavior as well using this simple job:
job "test" {
region = "a1"
datacenters = ["a1"]
type = "service"
group "tg" {
count = 1
task "tt" {
driver = "docker"
config {
image = "does/not/exist"
}
service {
port = "http"
}
resources {
network {
port "http" {}
}
}
vault {
policies = ["doesnotexist"]
}
}
}
}
The docker image does not exist, so the job continues to restart. With each restart, we get another service entry in Consul. (Because there is no defined check, they all show up as healthy services in Consul.)
nomad job stop -purge test will stop the job and remove it from Nomad, but the service entries remain in Consul.
The problem does not show up if we remove the vault stanza -- even though the policy doesn't exist and we never try to access a Vault secret.
Service entries in Consul are cleaned up if the Nomad client is restarted on the node where the job was attempting to run.


@endocrimes -- thank you so much!
Is this merged into the 0.9.4 RC?
This was released in 0.9.3 (https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#093-june-12-2019)
Most helpful comment
@endocrimes -- thank you so much!