Nomad: Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart

Created on 22 Apr 2019 · 2Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

Ubuntu 16.04LTS (our prod environment)
macOS Mojave (reproduction case).

Issue

tl;dr: processes launched by raw_exec tasks are orphaned if the Nomad agent restarts and a job stop/update stops allocations on the host that were live before the agent restart.

We use Nomad to run a simple guest VM supervisor process via raw_exec. After deploying Nomad 0.9 to a limited set of datacenters, a monitor started firing intermittently which indicated an instance of our supervisor found running without a Nomad executor parent process.

After some digging, we found this was happening to allocations started prior to a restart of the local Nomad agent. A while back, we tweaked our configuration management tooling as a workaround for #4413 so Nomad agents running in client mode would simply restart (regardless of whether TLS key material had been updated). Because of this behaviour, we often have allocations running that are older than the current Nomad agent process.

Allocations impacted by this issue have an event log much like this:

Recent Events:
Time                  Type        Description
2019-04-18T18:51:17Z  Killed      Task successfully killed
2019-04-18T18:51:17Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin415396184: connect: connection refused\""
2019-04-18T18:51:12Z  Killing     Sent interrupt
2019-04-17T18:42:49Z  Started     Task started by client
2019-04-17T18:42:48Z  Task Setup  Building Task Directory
2019-04-17T18:42:48Z  Received    Task received by client

Despite Nomad reporting Task successfully killed, the process launched by the raw_exec driver is still running at this point.

When Nomad reports Sent interrupt, the allocation's process does receive the appropriate signal; our tooling kicks off graceful shutdown/eventual forced termination of our guest VMs, so the orphaned process doesn't stick around for long. (This is why our orphaned process monitor was flapping.)

While submitting this issue, I realized the same failure mode occurs after restarting the Nomad agent and calling nomad stop [job_name] to stop that job's allocations on the agent's host. That's the case documented below.

Reproduction steps

Configs/resources

/tmp/nomad_config.json:

{
    "datacenter": "macbook",
    "data_dir": "/tmp/nomad_data",
    "disable_update_check": true,
    "enable_syslog": true,
    "leave_on_terminate": true,
    "syslog_facility": "LOCAL0",
    "log_level": "DEBUG",
    "server": {
        "bootstrap_expect": 1,
        "enabled": true
    },
    "client": {
        "enabled": true,
        "options": {
            "driver.whitelist": "raw_exec",
            "driver.raw_exec.enable": "1"
        }
    }
}

/tmp/testjob.sh:

#!/bin/bash

function log {
    echo "$(date) - testjob - $(echo ${NOMAD_GROUP_NAME}) - $(echo ${NOMAD_ALLOC_ID} | cut -f1 -d-) - PID ${$?} - ${1}" \
        | tee -a /tmp/testjob.log
}

function handleint {
    log "received interrupt but ignoring it"
}
trap handleint INT

SLEEPTIME="5"
while true; do
    log "sleeping for ${SLEEPTIME} seconds..."
    sleep "${SLEEPTIME}"
done

Procedure

Run the Nomad agent in its own terminal session: nomad agent -config /tmp/nomad_config.json
Run the simple job definition below: nomad run /tmp/testjob.hcl
tail -f /tmp/testjob.log to confirm allocation is live and generating output
In the Nomad agent's terminal session, hit Ctrl-C to exit gracefully; once the agent has exited, run it again.
Stop the Nomad job: nomad stop testjob
/tmp/testjob.log should immediately include a received interrupt but ignoring it line from the current allocation, and nomad status [alloc_id] should include aSent interrupt event.
Wait a few seconds (if necessary), then look at the allocation status again. It should include a Task successfully killed event following an RPC error with all SubConns are in TransientFailure and /tmp/testjob.log should continue showing output from the still-running bash script (now with a parent PID of 1) launched by the supposedly-complete allocation:

2019-04-22T14:08:28-07:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin954529704: connect: connection refused\""

Job file (if appropriate)

/tmp/testjob.hcl:

job "testjob" {
  datacenters = ["macbook"]
  type = "service"
  update {
    stagger      = "3s"
    max_parallel = 1
  }
  group "group1" {
    task "task1" {
      kill_timeout = "30s"
      driver = "raw_exec"
      config {
        command = "/tmp/testjob.sh"
      }
      resources {
        cpu = 50
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad agent logs after a Nomad client restart and nomad stop testjob:

    2019-04-22T14:43:50.921-0700 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=testjob duration=1.000755ms
    2019-04-22T14:43:50.924-0700 [DEBUG] http: request complete: method=GET path=/v1/job/testjob duration=446.263µs
    2019-04-22T14:43:50.928-0700 [DEBUG] worker: dequeued evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.928-0700 [DEBUG] http: request complete: method=DELETE path=/v1/job/testjob?purge=false duration=2.704206ms
    2019-04-22T14:43:50.928-0700 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 1)
Deployment Update for ID "9d3f5652-5912-9e71-32b0-ad1499e8155f": Status "cancelled"; Description "Cancelled because job is stopped"
Desired Changes for "group1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "group2": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2019-04-22T14:43:50.930-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=289.389µs
    2019-04-22T14:43:50.931-0700 [DEBUG] worker: submitted plan for evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.931-0700 [DEBUG] worker.service_sched: setting eval status: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default status=complete
    2019-04-22T14:43:50.931-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=169.652µs
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: updated evaluation: eval="<Eval "8ca24f07-4e90-a5d5-00ea-47ebfd2be47c" JobID: "testjob" Namespace: "default">"
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: ack evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.933-0700 [DEBUG] client: updated allocations: index=1460 total=5 pulled=5 filtered=0
    2019-04-22T14:43:50.933-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=5 ignored=0
    2019-04-22T14:43:50.937-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=5 ignored=0 errors=0
    2019-04-22T14:43:50.939-0700 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1 error="rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 backoff=5s error="executor Shutdown failed: rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: error fetching stats of task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin630219183: connect: connection refused""
    2019-04-22T14:43:51.185-0700 [DEBUG] client: updated allocations: index=1462 total=5 pulled=4 filtered=1
    2019-04-22T14:43:51.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:51.188-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:51.926-0700 [DEBUG] client.driver_mgr.raw_exec.executor: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1
    2019-04-22T14:43:51.936-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=230.683µs
    2019-04-22T14:43:51.937-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=92.024µs
    2019-04-22T14:43:55.944-0700 [INFO ] client.gc: marking allocation for GC: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6
    2019-04-22T14:43:56.186-0700 [DEBUG] client: updated allocations: index=1463 total=5 pulled=4 filtered=1
    2019-04-22T14:43:56.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:56.189-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.932-0700 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: updated allocations: index=1464 total=5 pulled=4 filtered=1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:59.190-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0

Source

cheeseprocedure

👍2

Most helpful comment

Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1!

cheeseprocedure on 23 Apr 2019

👍3

All 2 comments

@cheeseprocedure - We just released 0.9.1-rc which resolves this issue, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would be great if you can try it out and let us know if it fixed what you saw.