Nomad v0.10.2 (0d2d6e3dc5a171c21f8f31fa117c8a765eb4fc02)
CentOS Linux 7
When making certain changes to a jobfile and re-running the job, allocations can get stuck in a running state. I believe all allocations should be stopped when you run nomad stop example but that does not happen after some jobfile changes.
nomad run example.nomadnomad run example_modified.nomadnomad stop examplenomad job status examplesudo systemctl restart nomadnomad job status examplejob "example" {
datacenters = ["dal12"]
group "redis1" {
network {
mode = "bridge"
}
service {
name = "example-redis1"
port = 6379
connect {
sidecar_service {}
}
}
task "redis1" {
driver = "docker"
config {
image = "redis:5.0.5"
}
resources {
cpu = 500 # MHz
memory = 1024 # MB
}
}
}
group "redis2" {
network {
mode = "bridge"
}
service {
name = "example-redis2"
port = 6379
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "example-redis1"
local_bind_port = 6378
}
}
}
}
}
task "redis2" {
driver = "docker"
config {
image = "redis:5.0.5"
}
resources {
cpu = 500 # MHz
memory = 1024 # MB
}
}
}
}
(added port to redis1 network, then referenced it in service)
job "example" {
datacenters = ["dal12"]
group "redis1" {
network {
mode = "bridge"
port "tcp" {}
}
service {
name = "example-redis1"
port = "tcp"
connect {
sidecar_service {}
}
}
task "redis1" {
driver = "docker"
config {
image = "redis:5.0.5"
}
resources {
cpu = 500 # MHz
memory = 1024 # MB
}
}
}
group "redis2" {
network {
mode = "bridge"
}
service {
name = "example-redis2"
port = 6379
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "example-redis1"
local_bind_port = 6378
}
}
}
}
}
task "redis2" {
driver = "docker"
config {
image = "redis:5.0.5"
}
resources {
cpu = 500 # MHz
memory = 1024 # MB
}
}
}
}
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 @module=logmon path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stderr.fifo timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stderr.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.717-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.789-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.933-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.982-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:27.077-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.079-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:27.079-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.124-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.198-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.326-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.393-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.732-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.737-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.749-0600 [INFO] client.gc: marking allocation for GC: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.539-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.603-0600 [INFO] client.driver_mgr.docker: created container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.697-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.772-0600 [INFO] client.driver_mgr.docker: started container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.861-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.920-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.928-0600 [INFO] client.gc: marking allocation for GC: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed
Nothing of interest was logged on the servers during this time.
Hi @cjonesy and thanks for reporting this! I was able to reproduce using these job files in a local development environment.
After running the job update:
â–¶ nomad job status example
ID = example
Name = example
Submit Date = 2019-12-19T09:09:47-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
redis1 0 0 1 0 1 0
redis2 0 0 1 0 0 0
Latest Deployment
ID = e8799bae
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
redis1 1 1 0 0 2019-12-19T14:19:47Z
redis2 1 1 0 0 2019-12-19T14:19:47Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
270c6c11 a804c158 redis1 1 run running 10s ago 8s ago
c95b4e06 a804c158 redis1 0 stop complete 2m13s ago 9s ago
eb69149c a804c158 redis2 1 run running 2m13s ago 10s ago
â–¶ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d53063fd75b1 envoyproxy/envoy:v1.11.2 "/docker-entrypoint.…" 35 seconds ago Up 34 seconds connect-proxy-example-redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
9da1af1c2135 redis:5.0.5 "docker-entrypoint.s…" 35 seconds ago Up 34 seconds redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
0903c5986424 gcr.io/google_containers/pause-amd64:3.0 "/pause" 36 seconds ago Up 34 seconds nomad_init_270c6c11-e057-565e-fa40-a86a3e6bf670
446990b5a95a redis:5.0.5 "docker-entrypoint.s…" 2 minutes ago Up 2 minutes redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33 envoyproxy/envoy:v1.11.2 "/docker-entrypoint.…" 2 minutes ago Up 2 minutes connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 minutes ago Up 2 minutes nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0
After running the job stop:
â–¶ nomad job status example
ID = example
Name = example
Submit Date = 2019-12-19T09:09:47-05:00
Type = service
Priority = 50
Datacenters = dc1
Status = dead (stopped)
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
redis1 0 0 0 0 2 0
redis2 0 0 1 0 0 0
Latest Deployment
ID = e8799bae
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
redis1 1 1 1 0 2019-12-19T14:19:59Z
redis2 1 1 1 0 2019-12-19T14:19:57Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
270c6c11 a804c158 redis1 1 stop complete 1m58s ago 45s ago
c95b4e06 a804c158 redis1 0 stop complete 4m1s ago 1m57s ago
eb69149c a804c158 redis2 1 stop running 4m1s ago 45s ago
â–¶ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
446990b5a95a redis:5.0.5 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33 envoyproxy/envoy:v1.11.2 "/docker-entrypoint.…" 3 minutes ago Up 3 minutes connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc gcr.io/google_containers/pause-amd64:3.0 "/pause" 3 minutes ago Up 3 minutes nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0
Relevant window of the logs:
2019-12-19T14:10:59.805Z [DEBUG] http: request complete: method=DELETE path=/v1/job/example?purge=false duration=467.067µs
2019-12-19T14:10:59.805Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "redis1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "redis2": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)"
2019-12-19T14:10:59.806Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=85.613µs
2019-12-19T14:10:59.807Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=70.672µs
2019-12-19T14:10:59.807Z [DEBUG] worker: submitted plan for evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
2019-12-19T14:10:59.807Z [DEBUG] worker.service_sched: setting eval status: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default status=complete
2019-12-19T14:10:59.807Z [DEBUG] worker: updated evaluation: eval="<Eval "39c2244c-b8b4-5f9a-410f-b01a372d720a" JobID: "example" Namespace: "default">"
2019-12-19T14:10:59.807Z [DEBUG] worker: ack evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
2019-12-19T14:10:59.808Z [DEBUG] client: updated allocations: index=43 total=3 pulled=2 filtered=1
2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=1
2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=2 ignored=1 errors=0
2019-12-19T14:10:59.916Z [DEBUG] client: updated allocations: index=45 total=3 pulled=1 filtered=2
2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
2019-12-19T14:10:59.916Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
2019-12-19T14:10:59.979Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 error="container stopped"
2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6262
2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2019-12-19T14:10:59.985Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:72e91d8680d853b874d9aedda3a4b61048630d2043dd490ff36f5b0929f69874 references=3
2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 path=/opt/gopath/bin/nomad pid=6069
2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
2019-12-19T14:11:00.062Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 error="container stopped"
2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6229
2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2019-12-19T14:11:00.069Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:63130206b0fa808e4545a0cb4a1f14f6d40b8a7e2e6fda0a31fd326c2ac0971c references=2
2019-12-19T14:11:00.069Z [INFO] client.gc: marking allocation for GC: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670
2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 path=/opt/gopath/bin/nomad pid=6068
2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
2019-12-19T14:11:00.120Z [DEBUG] client: updated allocations: index=46 total=3 pulled=1 filtered=2
2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
2019-12-19T14:11:00.120Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
2019-12-19T14:11:00.316Z [DEBUG] client: updated allocations: index=47 total=3 pulled=1 filtered=2
2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
2019-12-19T14:11:00.316Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
2019-12-19T14:11:00.337Z [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=2 registered_checks=0 deregistered_checks=0
2019-12-19T14:11:00.519Z [DEBUG] client: updated allocations: index=48 total=3 pulled=1 filtered=2
2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
2019-12-19T14:11:00.519Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
2019-12-19T14:11:00.813Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=189.924µs
2019-12-19T14:11:00.814Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=88.981µs
This log entry looks promising for figuring out what happened here: client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
With a bit of debugging I've determined at this point that the task runner loop itself seems to be ok, and the scheduler is sending the update w/ a desired status of stop to the allocation as we'd expect.
The "discarding allocation update" log message is coming from alloc_runner.go#L697. When we get an update for an allocation, we check to make sure we don't have one queued up that's more recent. In this case, we find that there's one queued up with the same modify index as the one we have, which is definitely a clue.
What I've found is that when we submit the update for stop, it's never being dequeued, and the reason for that is because we never returned from (ar *allocRunner) handleAllocUpdate from when we made the first update. This explains why we only see the behavior after changes.
With that information in hand and some trace-level debugging I've narrowed it down to when we call into (h *groupServiceHook) Update, we never come back out. I suspect that this may be related to https://github.com/hashicorp/nomad/issues/6459 in some way. (cc @shoenig as a heads up)
After a quick chat w/ @shoenig I've assigned it to him to dig into further while working on #6459
Just chiming in here and seconding this issue. This happens even with the Consul Connect example wiht the count-api and count-dashboard, and I've had this happen with other jobs as well.
Not only do allocations not always stop, but the alloc directory in nomad's folder structure remains, which seems to make Nomad re-register any services to consul as well, making it impossible to get rid of it. I also tried stopping Nomad and forcefully rm -rf:ing that alloc folder on the host, but the /secrets folder in that directory is busy and the command exits with an error "Device or resource busy: .../secrets"
We just ran into this, this is really bad. The declared/reported state should always always match the runtime state.. isn't this the point of nomad?
Instead, we're seeing allocations leaking, and even nomad agent restarts do not remove them. These allocations are not reported in nomad status, you cannot query them with nomad alloc status, they persist through kills and are kept around until we forcefully remove the corresponding allocation directories(!!). Also, we're not using Connect, this is a general nomad bug.
I had a short look at the implementation, it seems there are several sources of truth for whether an allocation should exist(bad!), and these leaking allocations are not cleaned up from the boltdb state properly.
So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes?
Small update: clearing the boltdb entries (while the agent is stopped ofc) also works, which is less pesky than trying to remove the allocation directories with busy mounted volumes (they will be picked up by gc).
The boltdb file is at <agent_dir>/client/state.db
i have the same problem, but no access to the nodes themselves. is there any way to solve this without restarting the client?
@cjonesy I'm able to reproduce this with a version as late as v0.10.5, but not with a version v0.11.0 or later (which coincides with the fix that went in for #6459). Are you able to upgrade to a recent version of Nomad and confirm whether the problem persists?
@shoenig We recently upgraded to 0.11.3 and the problem no longer persists. Sounds like #6459 has indeed fixed the issue.
Thank you!
Most helpful comment
We just ran into this, this is really bad. The declared/reported state should always always match the runtime state.. isn't this the point of nomad?
Instead, we're seeing allocations leaking, and even nomad agent restarts do not remove them. These allocations are not reported in
nomad status, you cannot query them withnomad alloc status, they persist throughkills and are kept around until we forcefully remove the corresponding allocation directories(!!). Also, we're not using Connect, this is a general nomad bug.I had a short look at the implementation, it seems there are several sources of truth for whether an allocation should exist(bad!), and these leaking allocations are not cleaned up from the boltdb state properly.
So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes?