Nomad: [v0.9.5] Job tasks are stuck in Starting but are actually running

Created on 18 Nov 2019 · 8Comments · Source: hashicorp/nomad

A job task is stuck in Starting state while it seems to be running fine

Nomad version

Nomad v0.9.5-gc-1

Operating system and Environment details

Linux machine-1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Issue

Looking for the status of c-1 I see the following:

root@machine-1:~# nomad status
c-1                                     service  100       running  2019-11-17T20:45:13Z

This appears to be running according to Nomad

But when I dig deeper I see it's still in Starting for some reason, the single task that exists is in Starting

root@machine-1:~# nomad job status c-1
ID            = c-1
Name          = c-1
Submit Date   = 2019-11-17T20:45:13Z
Type          = service
Priority      = 100
Datacenters   = gc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
c-1     0       1         0        0       0         0

Allocations
ID        Node ID   Task Group      Version  Desired  Status   Created     Modified
a24ac3a0  b1dd5a5b c-1  0        run      pending  12h55m ago  10h41m ago

The problem is that the container seems to be running fine.
This completely fails a process in which I wait for my containers to be in Running state (using the job/job-id/summary

Reproduction steps

It seems that c-1 was running
Then Nomad client was restarted (on purpose)
From that point onward c-1 is in the above state

Job file (if appropriate)

{
    "Stop": false,
    "Region": "moo",
    "Namespace": "default",
    "ID": "c-1",
    "ParentID": "",
    "Name": "c-1",
    "Type": "service",
    "Priority": 100,
    "AllAtOnce": false,
    "Datacenters": [
        "gc"
    ],
    "Constraints": null,
    "Affinities": null,
    "Spreads": null,
    "TaskGroups": [
        {
            "Name": "c-1",
            "Count": 1,
            "Update": null,
            "Migrate": {
                "MaxParallel": 1,
                "HealthCheck": "checks",
                "MinHealthyTime": 10000000000,
                "HealthyDeadline": 300000000000
            },
            "Constraints": [
                {
                    "LTarget": "${meta.manager}",
                    "RTarget": "true",
                    "Operand": "=="
                },
                {
                    "LTarget": "${attr.os.signals}",
                    "RTarget": "SIGINT",
                    "Operand": "set_contains"
                }
            ],
            "RestartPolicy": {
                "Attempts": 5,
                "Interval": 300000000000,
                "Delay": 15000000000,
                "Mode": "fail"
            },
            "Tasks": [
                {
                    "Name": "c-1",
                    "Driver": "docker",
                    "User": "",
                    "Config": {
                        "args": [],
                        "logging": [
                            {
                                "type": "json-file"
                            }
                        ],
                        "memory_soft_limit": true,
                        "image": "172.16.0.1:5000/c-1:v31_2",
                        "hostname": null,
                        "volumes": [
                            "/etc/moo:/etc/moo"
                        ],
                        "privileged": false,
                        "labels": [
                            {
                                "image.name": "c-1"
                            },
                            {
                                "service.name": "c-1"
                            }
                        ],
                        "ulimit": [],
                        "cap_add": [
                            "SYS_PTRACE"
                        ],
                        "network_mode": "host",
                        "entrypoint": null,
                        "command": null
                    },
                    "Env": {
                        "service_name": "c-1"
                    },
                    "Services": [
                        {
                            "Name": "c-1",
                            "PortLabel": "",
                            "AddressMode": "auto",
                            "Tags": [
                                "c-1"
                            ],
                            "CanaryTags": null,
                            "Checks": null
                        }
                    ],
                    "Vault": null,
                    "Templates": null,
                    "Constraints": null,
                    "Affinities": null,
                    "Resources": {
                        "CPU": 20,
                        "MemoryMB": 120,
                        "DiskMB": 0,
                        "IOPS": 0,
                        "Networks": null,
                        "Devices": null
                    },
                    "DispatchPayload": null,
                    "Meta": null,
                    "KillTimeout": 15000000000,
                    "LogConfig": {
                        "MaxFiles": 10,
                        "MaxFileSizeMB": 10
                    },
                    "Artifacts": null,
                    "Leader": false,
                    "ShutdownDelay": 0,
                    "KillSignal": "SIGINT"
                }
            ],
            "EphemeralDisk": {
                "Sticky": false,
                "SizeMB": 300,
                "Migrate": false
            },
            "Meta": null,
            "ReschedulePolicy": {
                "Attempts": 0,
                "Interval": 0,
                "Delay": 10000000000,
                "DelayFunction": "exponential",
                "MaxDelay": 300000000000,
                "Unlimited": true
            },
            "Affinities": null,
            "Spreads": null
        }
    ],
    "Update": {
        "Stagger": 0,
        "MaxParallel": 0,
        "HealthCheck": "",
        "MinHealthyTime": 0,
        "HealthyDeadline": 0,
        "ProgressDeadline": 0,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
    },
    "Periodic": null,
    "ParameterizedJob": null,
    "Dispatched": false,
    "Payload": null,
    "Meta": null,
    "VaultToken": "",
    "Status": "running",
    "StatusDescription": "",
    "Stable": false,
    "Version": 0,
    "SubmitTime": 1574023513552310002,
    "CreateIndex": 54,
    "ModifyIndex": 56,
    "JobModifyIndex": 54
}

Nomad Client logs (if appropriate)

/var/lib/nomad/ansible/setup_nomad/setup_nomad_client.yml:66 ------------------
+ exec nomad agent '--config=/etc/nomad/client.conf'
==> Loaded configuration from /etc/nomad/client.conf
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.16.0.1:3636
            Bind Addrs: HTTP: 0.0.0.0:3636
                Client: true
             Log Level: DEBUG
                Region: global (DC: gc)
                Server: false
               Version: 0.9.5-gc-1

==> Nomad agent started! Log data will stream in below:

    2019-11-17T22:58:17.605Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/storage/nomad/plugins
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-11-17T22:58:17.606Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0

...

    2019-11-17T22:58:21.858Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.857Z
    2019-11-17T22:58:21.860Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=199
    2019-11-17T22:58:21.860Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
    2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
    2019-11-17T22:58:21.870Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker @module=docker_logger address=/tmp/plugin251495446 network=unix timestamp=2019-11-17T22:58:21.870Z
    2019-11-17T22:58:21.871Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2019-11-17T22:58:21.870Z
    2019-11-17T22:58:21.873Z [WARN ] client.driver_mgr.docker: failed to reattach to docker logger process: driver=docker error="failed to reattach to docker logger process: Reattachment process not found"
    2019-11-17T22:58:22.477Z [WARN ] client.alloc_runner.task_runner.task_hook: failed to reattach to logmon process: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 error="Reattachment process not found"
    2019-11-17T22:58:22.477Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=a24ac3a0-4dd0-2112-3f28-00fcda222c24 task=c-1 path=/usr/local/bin/nomad args="[/usr/local/bin/nomad logmon]"
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/usr/local/bin/nomad args="[/usr/local/bin/nomad docker_logger]"
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/usr/local/bin/nomad pid=212
    2019-11-17T22:58:21.873Z [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/usr/local/bin/nomad
    2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker address=/tmp/plugin933158382 network=unix @module=docker_logger timestamp=2019-11-17T22:58:21.886Z
    2019-11-17T22:58:21.886Z [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2

Nomad Server logs (if appropriate)

Don't have those

themclient themclient-restart typbug

Source

zonnie

Most helpful comment

@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot.

notnoop on 25 Nov 2019

👍2

All 8 comments

Thanks @zackiles for reaching out. That's a bug we must address indeed.

In the example above, mind if you post nomad alloc status <alloc_id> as well? Were there more logs associated with that task? I was expecting some log lines like the following:

    2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 version=2
    2019-11-18T08:40:41.397-0500 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon address=/var/folders/r6/346cfqyn76b_lx1nrcl5278c0000gp/T/plugin090993310 network=unix timestamp=2019-11-18T08:40:41.397-0500
    2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stdout.fifo timestamp=2019-11-18T08:40:41.398-0500
    2019-11-18T08:40:41.398-0500 [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fba6564e-1076-6bc4-dda8-b124d8c8deea task=c-1 @module=logmon path=/tmp/nomad-1/alloc/fba6564e-1076-6bc4-dda8-b124d8c8deea/alloc/logs/.c-1.stderr.fifo timestamp=2019-11-18T08:40:41.398-0500

One issue I note from the logs is that nomad children processes seem to have died, possibly by ansible in process of restarting task. When spinning up a task, nomad spins some auxiliary processes (docker_logger, logmon, executor) that expect to be running beyond nomad main process. If these processes die while nomad client is down, the recovery process may be hindered. That doesn't explain the bug I'm afraid, but I suspect it's a contributing cause.

notnoop on 18 Nov 2019

👍1

@notnoop - I will try and get the logs.
Can u suggest some way of avoiding this ?
Also, to my understanding one of the main reasons is that some peripheral processes don’t stick around while the client goes down ?
Isn’t that similar to a complete VM reboot ? Don’t u recover from that ?

zonnie on 19 Nov 2019

@notnoop - I was unable to get the logs and the machine on which it happens is no more 😢
Sorry...

zonnie on 19 Nov 2019

Hi @notnoop - does this suffice ?

root@gc-machine1:~# gc-nomad alloc status 88d520b9
ID                  = 88d520b9
Eval ID             = 798b0dc4
Name                = c1.c1[0]
Node ID             = fcf73cfe
Node Name           = nomad-client-172.16.0.1
Job ID              = elasticsearch
Job Version         = 824638472240
Client Status       = pending
Client Description  = No tasks have started
Desired Status      = run
Desired Description = <none>
Created             = 12h27m ago
Modified            = 10h23m ago

Task "c1" is "pending"
Task Resources
CPU        Memory           Disk     Addresses
14/20 MHz  1.3 GiB/1.5 GiB  300 MiB  service_port: 172.16.0.1:9200

Task Events:
Started At     = 2019-11-24T20:33:21Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2019-11-24T22:37:15Z

Recent Events:
Time                  Type              Description
2019-11-24T22:37:15Z  Restarting        Task restarting in 18.375889653s
2019-11-24T22:37:15Z  Task hook failed  logmon: Unrecognized remote plugin message:

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.
2019-11-24T20:33:21Z  Started           Task started by client
2019-11-24T20:33:20Z  Task Setup        Building Task Directory
2019-11-24T20:33:19Z  Received          Task received by client

The job output is: (it's a reproduction of the above in a new environment)

root@machine-1:~# gc-nomad job status c1
ID            = c1
Name          = c1
Submit Date   = 2019-11-24T20:33:19Z
Type          = service
Priority      = 90
Datacenters   = gc
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
c1  0       1         0        0       0         0

Allocations
ID        Node ID   Task Group     Version  Desired  Status   Created     Modified
88d520b9  fcf73cfe  c1  0        run      pending  12h31m ago  10h27m ago

zonnie on 25 Nov 2019

@zonnie that's very useful indeed - I've very surprised by the "Unrecognized remote plugin message" event. Let me do some further research and reach out if I have any questions. Thanks a lot.

notnoop on 25 Nov 2019

👍2

I've encountered similar behavior around re-starting client nodes. I'm uncertain as to the steps I took to achieve this, as it happened some time during a larger docker version migration, and I've been unable to reproduce it.

The behavior is similar in that a task seems to be stuck in "starting", while having only allocations that are status "running" with desired "run".

> nomad job status -all-allocs -evals -verbose recommendation-svc-release-7
ID            = recommendation-svc-release-7
Name          = recommendation-svc-release-7
Submit Date   = 2020-04-14T12:00:08+02:00
Type          = service
Priority      = 50
Datacenters   = dev-eu-central-1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
service     0       1         1        0       7         2

Evaluations
ID                                    Priority  Triggered By   Status    Placement Failures
a2d4f3d3-dab9-d959-2446-cb5f92427b6b  50        node-update    complete  false
999a2d88-10b3-7e68-c247-faec85620f84  50        queued-allocs  canceled  false
43e7c710-df17-fa5f-a399-78efd8fe397e  50        node-drain     complete  true
a82f73d9-9b05-f5b6-e1a8-fffd9c964559  50        node-drain     complete  false

Allocations
ID                                    Eval ID                               Node ID                               Node Name            Task Group  Version  Desired  Status    Created                    Modified
4a842a44-5302-0061-736d-9c6a9ba45327  a2d4f3d3-dab9-d959-2446-cb5f92427b6b  cc066f7d-f91c-fc99-ca02-3c2d521e84b5  i-0fec4e4281032614f  service     3        run      running   2020-04-27T11:00:37+02:00  2020-04-27T11:00:59+02:00
391e4ed9-bab1-b6a2-0d7e-87a164b4a382  a82f73d9-9b05-f5b6-e1a8-fffd9c964559  f6423ad0-1de8-509f-3a68-a6beaaff7f54  i-035a6f13756ba563e  service     3        stop     complete  2020-04-24T11:45:57+02:00  2020-04-27T10:58:42+02:00

In the above output, the job has been re-started twice with one queued alloc failing due to resource exhaustion after draining some nodes experimentally. The behavior - one "starting" in task group - began before then, and persisted after. The job is configured to only have one allocation running as well.

Due to the difference of all my allocations "running" instead of "pending" I'm not sure if my issue is related to the above at all.

draoncc on 27 Apr 2020

We've found that we see this issue particularly on nodes where there are high counts of allocations on the node or when the node itself is fairly busy. Allocation count wise, a team mate of mine was able to replicate the issue when there were more than 24 allocations on a given node. Marking a node as ineligible and stopping the particular allocation to get it to move to a node that has fewer allocations has been our way of getting the allocation in a better place.

Shackelford-Arden on 6 Oct 2020

@notnoop Any luck looking into the logmon: Unrecognized remote plugin message error? I have been running into this for the past ~6 weeks as well.