Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)
tl;dr: processes launched by raw_exec tasks are orphaned if the Nomad agent restarts and a job stop/update stops allocations on the host that were live before the agent restart.
We use Nomad to run a simple guest VM supervisor process via raw_exec. After deploying Nomad 0.9 to a limited set of datacenters, a monitor started firing intermittently which indicated an instance of our supervisor found running without a Nomad executor parent process.
After some digging, we found this was happening to allocations started prior to a restart of the local Nomad agent. A while back, we tweaked our configuration management tooling as a workaround for #4413 so Nomad agents running in client mode would simply restart (regardless of whether TLS key material had been updated). Because of this behaviour, we often have allocations running that are older than the current Nomad agent process.
Allocations impacted by this issue have an event log much like this:
Recent Events:
Time Type Description
2019-04-18T18:51:17Z Killed Task successfully killed
2019-04-18T18:51:17Z Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin415396184: connect: connection refused\""
2019-04-18T18:51:12Z Killing Sent interrupt
2019-04-17T18:42:49Z Started Task started by client
2019-04-17T18:42:48Z Task Setup Building Task Directory
2019-04-17T18:42:48Z Received Task received by client
Despite Nomad reporting Task successfully killed, the process launched by the raw_exec driver is still running at this point.
When Nomad reports Sent interrupt, the allocation's process does receive the appropriate signal; our tooling kicks off graceful shutdown/eventual forced termination of our guest VMs, so the orphaned process doesn't stick around for long. (This is why our orphaned process monitor was flapping.)
While submitting this issue, I realized the same failure mode occurs after restarting the Nomad agent and calling nomad stop [job_name] to stop that job's allocations on the agent's host. That's the case documented below.
/tmp/nomad_config.json:
{
"datacenter": "macbook",
"data_dir": "/tmp/nomad_data",
"disable_update_check": true,
"enable_syslog": true,
"leave_on_terminate": true,
"syslog_facility": "LOCAL0",
"log_level": "DEBUG",
"server": {
"bootstrap_expect": 1,
"enabled": true
},
"client": {
"enabled": true,
"options": {
"driver.whitelist": "raw_exec",
"driver.raw_exec.enable": "1"
}
}
}
/tmp/testjob.sh:
#!/bin/bash
function log {
echo "$(date) - testjob - $(echo ${NOMAD_GROUP_NAME}) - $(echo ${NOMAD_ALLOC_ID} | cut -f1 -d-) - PID ${$?} - ${1}" \
| tee -a /tmp/testjob.log
}
function handleint {
log "received interrupt but ignoring it"
}
trap handleint INT
SLEEPTIME="5"
while true; do
log "sleeping for ${SLEEPTIME} seconds..."
sleep "${SLEEPTIME}"
done
nomad agent -config /tmp/nomad_config.jsonnomad run /tmp/testjob.hcltail -f /tmp/testjob.log to confirm allocation is live and generating outputnomad stop testjob/tmp/testjob.log should immediately include a received interrupt but ignoring it line from the current allocation, and nomad status [alloc_id] should include aSent interrupt event.Task successfully killed event following an RPC error with all SubConns are in TransientFailure and /tmp/testjob.log should continue showing output from the still-running bash script (now with a parent PID of 1) launched by the supposedly-complete allocation:2019-04-22T14:08:28-07:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin954529704: connect: connection refused\""
/tmp/testjob.hcl:
job "testjob" {
datacenters = ["macbook"]
type = "service"
update {
stagger = "3s"
max_parallel = 1
}
group "group1" {
task "task1" {
kill_timeout = "30s"
driver = "raw_exec"
config {
command = "/tmp/testjob.sh"
}
resources {
cpu = 50
}
}
}
}
Nomad agent logs after a Nomad client restart and nomad stop testjob:
2019-04-22T14:43:50.921-0700 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=testjob duration=1.000755ms
2019-04-22T14:43:50.924-0700 [DEBUG] http: request complete: method=GET path=/v1/job/testjob duration=446.263碌s
2019-04-22T14:43:50.928-0700 [DEBUG] worker: dequeued evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
2019-04-22T14:43:50.928-0700 [DEBUG] http: request complete: method=DELETE path=/v1/job/testjob?purge=false duration=2.704206ms
2019-04-22T14:43:50.928-0700 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 1)
Deployment Update for ID "9d3f5652-5912-9e71-32b0-ad1499e8155f": Status "cancelled"; Description "Cancelled because job is stopped"
Desired Changes for "group1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "group2": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
2019-04-22T14:43:50.930-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=289.389碌s
2019-04-22T14:43:50.931-0700 [DEBUG] worker: submitted plan for evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
2019-04-22T14:43:50.931-0700 [DEBUG] worker.service_sched: setting eval status: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default status=complete
2019-04-22T14:43:50.931-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=169.652碌s
2019-04-22T14:43:50.933-0700 [DEBUG] worker: updated evaluation: eval="<Eval "8ca24f07-4e90-a5d5-00ea-47ebfd2be47c" JobID: "testjob" Namespace: "default">"
2019-04-22T14:43:50.933-0700 [DEBUG] worker: ack evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
2019-04-22T14:43:50.933-0700 [DEBUG] client: updated allocations: index=1460 total=5 pulled=5 filtered=0
2019-04-22T14:43:50.933-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=5 ignored=0
2019-04-22T14:43:50.937-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=5 ignored=0 errors=0
2019-04-22T14:43:50.939-0700 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1 error="rpc error: code = Unavailable desc = transport is closing"
2019-04-22T14:43:50.940-0700 [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 backoff=5s error="executor Shutdown failed: rpc error: code = Unavailable desc = transport is closing"
2019-04-22T14:43:50.940-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: error fetching stats of task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin630219183: connect: connection refused""
2019-04-22T14:43:51.185-0700 [DEBUG] client: updated allocations: index=1462 total=5 pulled=4 filtered=1
2019-04-22T14:43:51.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
2019-04-22T14:43:51.188-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
2019-04-22T14:43:51.926-0700 [DEBUG] client.driver_mgr.raw_exec.executor: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1
2019-04-22T14:43:51.936-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=230.683碌s
2019-04-22T14:43:51.937-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=92.024碌s
2019-04-22T14:43:55.944-0700 [INFO ] client.gc: marking allocation for GC: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6
2019-04-22T14:43:56.186-0700 [DEBUG] client: updated allocations: index=1463 total=5 pulled=4 filtered=1
2019-04-22T14:43:56.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
2019-04-22T14:43:56.189-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
2019-04-22T14:43:58.932-0700 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
2019-04-22T14:43:59.187-0700 [DEBUG] client: updated allocations: index=1464 total=5 pulled=4 filtered=1
2019-04-22T14:43:59.187-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
2019-04-22T14:43:59.190-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
@cheeseprocedure - We just released 0.9.1-rc which resolves this issue, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would be great if you can try it out and let us know if it fixed what you saw.
Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1!
Most helpful comment
Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1!