A job task is stuck in Starting state while it seems to be running fine
Nomad v0.9.5-gc-1
Linux machine-1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Looking for the status of c-1 I see the following:
root@machine-1:~# nomad status
c-1 service 100 running 2019-11-17T20:45:13Z
This appears to be running according to Nomad
But when I dig deeper I see it's still in Starting for some reason, the single task that exists is in Starting
root@machine-1:~# nomad job status c-1
ID = c-1
Name = c-1
Submit Date = 2019-11-17T20:45:13Z
Type = service
Priority = 100
Datacenters = gc
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
c-1 0 1 0 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
a24ac3a0 b1dd5a5b c-1 0 run pending 12h55m ago 10h41m ago
The problem is that the container seems to be running fine.
This completely fails a process in which I wait for my containers to be in Running state (using the job/job-id/summary
c-1 was runningc-1 is in the above state{
"Stop": false,
"Region": "moo",
"Namespace": "default",
"ID": "c-1",
"ParentID": "",
"Name": "c-1",
"Type": "service",
"Priority": 100,
"AllAtOnce": false,
"Datacenters": [
"gc"
],
"Constraints": null,
"Affinities": null,
"Spreads": null,
"TaskGroups": [
{
"Name": "c-1",
"Count": 1,
"Update": null,
"Migrate": {
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000
},
"Constraints": [
{
"LTarget": "${meta.manager}",
"RTarget": "true",
"Operand": "=="
},
{
"LTarget": "${attr.os.signals}",
"RTarget": "SIGINT",
"Operand": "set_contains"
}
],
"RestartPolicy": {
"Attempts": 5,
"Interval": 300000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"Tasks": [
{
"Name": "c-1",
"Driver": "docker",
"User": "",
"Config": {
"args": [],
"logging": [
{
"type": "json-file"
}
],
"memory_soft_limit": true,
"image": "172.16.0.1:5000/c-1:v31_2",
"hostname": null,
"volumes": [
"/etc/moo:/etc/moo"
],
"privileged": false,
"labels": [
{
"image.name": "c-1"
},
{
"service.name": "c-1"
}
],
"ulimit": [],
"cap_add": [
"SYS_PTRACE"
],
"network_mode": "host",
"entrypoint": null,
"command": null
},
"Env": {
"service_name": "c-1"
},
"Services": [
{
"Name": "c-1",
"PortLabel": "",
"AddressMode": "auto",
"Tags": [
"c-1"
],
"CanaryTags": null,
"Checks": null
}
],
"Vault": null,
"Templates": null,
"Constraints": null,
"Affinities": null,
"Resources": {
"CPU": 20,
"MemoryMB": 120,
"DiskMB": 0,
"IOPS": 0,
"Networks": null,
"Devices": null
},
"DispatchPayload": null,
"Meta": null,
"KillTimeout": 15000000000,
"LogConfig": {
"MaxFiles": 10,
"MaxFileSizeMB": 10
},
"Artifacts": null,
"Leader": false,
"ShutdownDelay": 0,
"KillSignal": "SIGINT"
}
],
"EphemeralDisk": {
"Sticky": false,
"SizeMB": 300,
"Migrate": false
},
"Meta": null,
"ReschedulePolicy": {
"Attempts": 0,
"Interval": 0,
"Delay": 10000000000,
"DelayFunction": "exponential",
"MaxDelay": 300000000000,
"Unlimited": true
},
"Affinities": null,
"Spreads": null
}
],
"Update": {
"Stagger": 0,
"MaxParallel": 0,
"HealthCheck": "",
"MinHealthyTime": 0,
"HealthyDeadline": 0,
"ProgressDeadline": 0,
"AutoRevert": false,
"AutoPromote": false,
"Canary": 0
},
"Periodic": null,
"ParameterizedJob": null,
"Dispatched": false,
"Payload": null,
"Meta": null,
"VaultToken": "",
"Status": "running",
"StatusDescription": "",
"Stable": false,
"Version": 0,
"SubmitTime": 1574023513552310002,
"CreateIndex": 54,
"ModifyIndex": 56,
"JobModifyIndex": 54
}
/var/lib/nomad/ansible/setup_nomad/setup_nomad_client.yml:66 ------------------
+ exec nomad agent '--config=/etc/nomad/client.conf'
==> Loaded configuration from /etc/nomad/client.conf
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 172.16.0.1:3636
Bind Addrs: HTTP: 0.0.0.0:3636
Client: true
Log Level: DEBUG
Region: global (DC: gc)
Server: false
Version: 0.9.5-gc-1
==> Nomad agent started! Log data will stream in below:
2019-11-17T22:58:17.605Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/storage/nomad/plugins
2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
...
2019-11-17T22:58:21.858Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.857Z
2019-11-17T22:58:21.860Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=199
2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker @module=docker_logger address=/tmp/plugin251495446 network=unix timestamp=2019-11-17T22:58:21.870Z
2019-11-17T22:58:21.871Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.870Z
2019-11-17T22:58:21.873Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
2019-11-17T22:58:22.477Z [WARN ] client.alloc_runner.task_runner.task_hook: failed to reattach to logmon process: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 error="Reattachment process not found"
2019-11-17T22:58:22.477Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 path=/usr/local/bin/nomad args="[/usr/local/bin/nomad logmon]"
2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=212
2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker address=/tmp/plugin933158382 network=unix @module=docker_logger timestamp=2019-11-17T22:58:21.886Z
2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
Don't have those
Thanks @zackiles for reaching out. That's a bug we must address indeed.
In the example above, mind if you post nomad alloc status <alloc_id> as well? Were there more logs associated with that task? I was expecting some log lines like the following:
2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 version=2
2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon address=/var/folders/r6/346cfqyn76b_lx1nrcl5278c0000gp/T/plugin090993310 network=unix timestamp=2019-11-18T08:40:41.397-0500
2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stdout.fifo timestamp=2019-11-18T08:40:41.398-0500
2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stderr.fifo timestamp=2019-11-18T08:40:41.398-0500
One issue I note from the logs is that nomad children processes seem to have died, possibly by ansible in process of restarting task. When spinning up a task, nomad spins some auxiliary processes (docker_logger, logmon, executor) that expect to be running beyond nomad main process. If these processes die while nomad client is down, the recovery process may be hindered. That doesn't explain the bug I'm afraid, but I suspect it's a contributing cause.
@notnoop - I will try and get the logs.
Can u suggest some way of avoiding this ?
Also, to my understanding one of the main reasons is that some peripheral processes don鈥檛 stick around while the client goes down ?
Isn鈥檛 that similar to a complete VM reboot ? Don鈥檛 u recover from that ?
@notnoop - I was unable to get the logs and the machine on which it happens is no more 馃槩
Sorry...
Hi @notnoop - does this suffice ?
root@gc-machine1:~# gc-nomad alloc status 88d520b9
ID = 88d520b9
Eval ID = 798b0dc4
Name = c1.c1[0]
Node ID = fcf73cfe
Node Name = nomad-client-172.16.0.1
Job ID = elasticsearch
Job Version = 824638472240
Client Status = pending
Client Description = No tasks have started
Desired Status = run
Desired Description = <none>
Created = 12h27m ago
Modified = 10h23m ago
Task "c1" is "pending"
Task Resources
CPU Memory Disk Addresses
14/20 MHz 1.3 GiB/1.5 GiB 300 MiB service_port: 172.16.0.1:9200
Task Events:
Started At = 2019-11-24T20:33:21Z
Finished At = N/A
Total Restarts = 1
Last Restart = 2019-11-24T22:37:15Z
Recent Events:
Time Type Description
2019-11-24T22:37:15Z Restarting Task restarting in 18.375889653s
2019-11-24T22:37:15Z Task hook failed logmon: Unrecognized remote plugin message:
This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.
2019-11-24T20:33:21Z Started Task started by client
2019-11-24T20:33:20Z Task Setup Building Task Directory
2019-11-24T20:33:19Z Received Task received by client
The job output is: (it's a reproduction of the above in a new environment)
root@machine-1:~# gc-nomad job status c1
ID = c1
Name = c1
Submit Date = 2019-11-24T20:33:19Z
Type = service
Priority = 90
Datacenters = gc
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
c1 0 1 0 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
88d520b9 fcf73cfe c1 0 run pending 12h31m ago 10h27m ago
@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot.
I've encountered similar behavior around re-starting client nodes. I'm uncertain as to the steps I took to achieve this, as it happened some time during a larger docker version migration, and I've been unable to reproduce it.
The behavior is similar in that a task seems to be stuck in "starting", while having only allocations that are status "running" with desired "run".
> nomad job status -all-allocs -evals -verbose recommendation-svc-release-7
ID = recommendation-svc-release-7
Name = recommendation-svc-release-7
Submit Date = 2020-04-14T12:00:08+02:00
Type = service
Priority = 50
Datacenters = dev-eu-central-1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
service 0 1 1 0 7 2
Evaluations
ID Priority Triggered By Status Placement Failures
a2d4f3d3-dab9-d959-2446-cb5f92427b6b 50 node-update complete false
999a2d88-10b3-7e68-c247-faec85620f84 50 queued-allocs canceled false
43e7c710-df17-fa5f-a399-78efd8fe397e 50 node-drain complete true
a82f73d9-9b05-f5b6-e1a8-fffd9c964559 50 node-drain complete false
Allocations
ID Eval ID Node ID Node Name Task Group Version Desired Status Created Modified
4a842a44-5302-0061-736d-9c6a9ba45327 a2d4f3d3-dab9-d959-2446-cb5f92427b6b cc066f7d-f91c-fc99-ca02-3c2d521e84b5 i-0fec4e4281032614f service 3 run running 2020-04-27T11:00:37+02:00 2020-04-27T11:00:59+02:00
391e4ed9-bab1-b6a2-0d7e-87a164b4a382 a82f73d9-9b05-f5b6-e1a8-fffd9c964559 f6423ad0-1de8-509f-3a68-a6beaaff7f54 i-035a6f13756ba563e service 3 stop complete 2020-04-24T11:45:57+02:00 2020-04-27T10:58:42+02:00
In the above output, the job has been re-started twice with one queued alloc failing due to resource exhaustion after draining some nodes experimentally. The behavior - one "starting" in task group - began before then, and persisted after. The job is configured to only have one allocation running as well.
Due to the difference of all my allocations "running" instead of "pending" I'm not sure if my issue is related to the above at all.
We've found that we see this issue particularly on nodes where there are high counts of allocations on the node or when the node itself is fairly busy. Allocation count wise, a team mate of mine was able to replicate the issue when there were more than 24 allocations on a given node. Marking a node as ineligible and stopping the particular allocation to get it to move to a node that has fewer allocations has been our way of getting the allocation in a better place.
@notnoop Any luck looking into the logmon: Unrecognized remote plugin message error? I have been running into this for the past ~6 weeks as well.
Most helpful comment
@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot.