Nomad: Allocations not stopping after modifying a job

Created on 18 Dec 2019 · 9Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.10.2 (0d2d6e3dc5a171c21f8f31fa117c8a765eb4fc02)

Operating system and Environment details

CentOS Linux 7

Issue

When making certain changes to a jobfile and re-running the job, allocations can get stuck in a running state. I believe all allocations should be stopped when you run nomad stop example but that does not happen after some jobfile changes.

Reproduction steps

Kick off the job and wait for it to start: nomad run example.nomad
Now modify the job (see jobfile below): nomad run example_modified.nomad
Stop the job: nomad stop example
Notice one allocation is still running: nomad job status example
Restart the nomad client: sudo systemctl restart nomad
Notice the allocation is now stopped: nomad job status example

Job file (if appropriate)

Initial jobfile:

job "example" {
  datacenters = ["dal12"]

  group "redis1" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis1"
      port = 6379

      connect {
        sidecar_service {}
      }
    }

    task "redis1" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }

  group "redis2" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis2"
      port = 6379

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "example-redis1"
              local_bind_port  = 6378
            }
          }
        }
      }
    }

    task "redis2" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }
}

Modified Jobfile

(added port to redis1 network, then referenced it in service)

job "example" {
  datacenters = ["dal12"]

  group "redis1" {
    network {
      mode = "bridge"
      port "tcp" {}
    }

    service {
      name = "example-redis1"
      port = "tcp"

      connect {
        sidecar_service {}
      }
    }

    task "redis1" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }

  group "redis2" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis2"
      port = 6379

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "example-redis1"
              local_bind_port  = 6378
            }
          }
        }
      }
    }

    task "redis2" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad Client Node 1

Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 @module=logmon path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stderr.fifo timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stderr.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.717-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.789-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.933-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.982-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:27.077-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.079-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:27.079-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.124-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.198-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.326-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.393-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.732-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.737-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.749-0600 [INFO]  client.gc: marking allocation for GC: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312

Nomad Client Node 3

Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.539-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.603-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.697-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.772-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.861-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.920-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.928-0600 [INFO]  client.gc: marking allocation for GC: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed

Nomad Server logs (if appropriate)

Nothing of interest was logged on the servers during this time.

themconsuconnect typbug

Source

cjonesy

Most helpful comment

We just ran into this, this is really bad. The declared/reported state should always always match the runtime state.. isn't this the point of nomad?
Instead, we're seeing allocations leaking, and even nomad agent restarts do not remove them. These allocations are not reported in nomad status, you cannot query them with nomad alloc status, they persist through kills and are kept around until we forcefully remove the corresponding allocation directories(!!). Also, we're not using Connect, this is a general nomad bug.

I had a short look at the implementation, it seems there are several sources of truth for whether an allocation should exist(bad!), and these leaking allocations are not cleaned up from the boltdb state properly.

So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes?

exFalso on 15 Jun 2020

👍2

All 9 comments

Hi @cjonesy and thanks for reporting this! I was able to reproduce using these job files in a local development environment.

After running the job update:

▶ nomad job status example
ID            = example
Name          = example
Submit Date   = 2019-12-19T09:09:47-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
redis1      0       0         1        0       1         0
redis2      0       0         1        0       0         0

Latest Deployment
ID          = e8799bae
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
redis1      1        1       0        0          2019-12-19T14:19:47Z
redis2      1        1       0        0          2019-12-19T14:19:47Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
270c6c11  a804c158  redis1      1        run      running   10s ago    8s ago
c95b4e06  a804c158  redis1      0        stop     complete  2m13s ago  9s ago
eb69149c  a804c158  redis2      1        run      running   2m13s ago  10s ago

▶ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
d53063fd75b1        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   35 seconds ago      Up 34 seconds                           connect-proxy-example-redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
9da1af1c2135        redis:5.0.5                                "docker-entrypoint.s…"   35 seconds ago      Up 34 seconds                           redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
0903c5986424        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 36 seconds ago      Up 34 seconds                           nomad_init_270c6c11-e057-565e-fa40-a86a3e6bf670
446990b5a95a        redis:5.0.5                                "docker-entrypoint.s…"   2 minutes ago       Up 2 minutes                            redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   2 minutes ago       Up 2 minutes                            connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 2 minutes ago       Up 2 minutes                            nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0

After running the job stop:

▶ nomad job status example
ID            = example
Name          = example
Submit Date   = 2019-12-19T09:09:47-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
redis1      0       0         0        0       2         0
redis2      0       0         1        0       0         0

Latest Deployment
ID          = e8799bae
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
redis1      1        1       1        0          2019-12-19T14:19:59Z
redis2      1        1       1        0          2019-12-19T14:19:57Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
270c6c11  a804c158  redis1      1        stop     complete  1m58s ago  45s ago
c95b4e06  a804c158  redis1      0        stop     complete  4m1s ago   1m57s ago
eb69149c  a804c158  redis2      1        stop     running   4m1s ago   45s ago

▶ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
446990b5a95a        redis:5.0.5                                "docker-entrypoint.s…"   3 minutes ago       Up 3 minutes                            redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   3 minutes ago       Up 3 minutes                            connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 3 minutes ago       Up 3 minutes                            nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0

Relevant window of the logs:

    2019-12-19T14:10:59.805Z [DEBUG] http: request complete: method=DELETE path=/v1/job/example?purge=false duration=467.067µs
    2019-12-19T14:10:59.805Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "redis1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "redis2": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)"
    2019-12-19T14:10:59.806Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=85.613µs
    2019-12-19T14:10:59.807Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=70.672µs
    2019-12-19T14:10:59.807Z [DEBUG] worker: submitted plan for evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
    2019-12-19T14:10:59.807Z [DEBUG] worker.service_sched: setting eval status: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default status=complete
    2019-12-19T14:10:59.807Z [DEBUG] worker: updated evaluation: eval="<Eval "39c2244c-b8b4-5f9a-410f-b01a372d720a" JobID: "example" Namespace: "default">"
    2019-12-19T14:10:59.807Z [DEBUG] worker: ack evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
    2019-12-19T14:10:59.808Z [DEBUG] client: updated allocations: index=43 total=3 pulled=2 filtered=1
    2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=1
    2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=2 ignored=1 errors=0
    2019-12-19T14:10:59.916Z [DEBUG] client: updated allocations: index=45 total=3 pulled=1 filtered=2
    2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:10:59.916Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:10:59.979Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 error="container stopped"
    2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6262
    2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2019-12-19T14:10:59.985Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:72e91d8680d853b874d9aedda3a4b61048630d2043dd490ff36f5b0929f69874 references=3
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 path=/opt/gopath/bin/nomad pid=6069
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
    2019-12-19T14:11:00.062Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 error="container stopped"
    2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6229
    2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2019-12-19T14:11:00.069Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:63130206b0fa808e4545a0cb4a1f14f6d40b8a7e2e6fda0a31fd326c2ac0971c references=2
    2019-12-19T14:11:00.069Z [INFO]  client.gc: marking allocation for GC: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 path=/opt/gopath/bin/nomad pid=6068
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
    2019-12-19T14:11:00.120Z [DEBUG] client: updated allocations: index=46 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.120Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.316Z [DEBUG] client: updated allocations: index=47 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.316Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.337Z [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=2 registered_checks=0 deregistered_checks=0
    2019-12-19T14:11:00.519Z [DEBUG] client: updated allocations: index=48 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.519Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.813Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=189.924µs
    2019-12-19T14:11:00.814Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=88.981µs

This log entry looks promising for figuring out what happened here: client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43

tgross on 19 Dec 2019

👍1

With a bit of debugging I've determined at this point that the task runner loop itself seems to be ok, and the scheduler is sending the update w/ a desired status of stop to the allocation as we'd expect.

The "discarding allocation update" log message is coming from alloc_runner.go#L697. When we get an update for an allocation, we check to make sure we don't have one queued up that's more recent. In this case, we find that there's one queued up with the same modify index as the one we have, which is definitely a clue.

What I've found is that when we submit the update for stop, it's never being dequeued, and the reason for that is because we never returned from (ar *allocRunner) handleAllocUpdate from when we made the first update. This explains why we only see the behavior after changes.

With that information in hand and some trace-level debugging I've narrowed it down to when we call into (h *groupServiceHook) Update, we never come back out. I suspect that this may be related to https://github.com/hashicorp/nomad/issues/6459 in some way. (cc @shoenig as a heads up)

tgross on 19 Dec 2019

After a quick chat w/ @shoenig I've assigned it to him to dig into further while working on #6459

tgross on 20 Dec 2019

Just chiming in here and seconding this issue. This happens even with the Consul Connect example wiht the count-api and count-dashboard, and I've had this happen with other jobs as well.

Not only do allocations not always stop, but the alloc directory in nomad's folder structure remains, which seems to make Nomad re-register any services to consul as well, making it impossible to get rid of it. I also tried stopping Nomad and forcefully rm -rf:ing that alloc folder on the host, but the /secrets folder in that directory is busy and the command exits with an error "Device or resource busy: .../secrets"

tommyalatalo on 30 Jan 2020

So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes?

exFalso on 15 Jun 2020

👍2

Small update: clearing the boltdb entries (while the agent is stopped ofc) also works, which is less pesky than trying to remove the allocation directories with busy mounted volumes (they will be picked up by gc).
The boltdb file is at <agent_dir>/client/state.db

exFalso on 1 Jul 2020

i have the same problem, but no access to the nodes themselves. is there any way to solve this without restarting the client?

marvreichmann on 9 Jul 2020

@cjonesy I'm able to reproduce this with a version as late as v0.10.5, but not with a version v0.11.0 or later (which coincides with the fix that went in for #6459). Are you able to upgrade to a recent version of Nomad and confirm whether the problem persists?