Nomad: Template failed to send signals [user defined signal 1]: 1 error(s) occurred: * Task not running

Created on 22 Mar 2019 · 13Comments · Source: hashicorp/nomad

Using Nomad 0.9.0-beta3 here.

I have a haproxy task that I'm trying to restart on template changes (from Consul Template) but the entire task/group fails with:

Template failed to send signals [user defined signal 1]: 1 error(s) occurred: * Task not running
--

And the job remains in a failed state.

Job spec looks something like this:

```#!hcl
job "haproxy" {
datacenters = ["dc1"]
type = "system"

update {
max_parallel = 1
min_healthy_time = "30s"
auto_revert = true
}

group "haproxy" {
restart {
interval = "6m"
attempts = 10
delay = "30s"
mode = "delay"
}

task "haproxy" {
  driver = "exec"

  artifact {
    source      = "http://packages.local:8000/haproxy-v1.9.4.tar.gz"
    destination = "/"
    options {
      archive = "tar.gz"
    }
  }

  template {
    source = "/nomad/conf/configs/haproxy/haproxy.cfg.tpl"
    destination = "/etc/haproxy/haproxy.cfg"
    change_mode   = "signal"
    change_signal = "SIGUSR1"
  }

  config {
    command = "usr/local/sbin/haproxy"
    args= ["-f", "/etc/haproxy/haproxy.cfg"]
  }

  service {
    name = "haproxy"
    port = "xxx"
    check {
      name = "alive"
      type = "tcp"
      interval = "10s"
      timeout = "2s"
    }
  }

  resources {
    cpu = 200
    memory = 128
    network {
      mbits = 20
      port "xxx" {
        static = 4443"
      }
    }
  }
}

}
}
```

Couple of questions:

Why do we see this "process not found" on a template change and signal-based restart?
Why does the job remain in a failed state?

stagneeds-investigation themclient themtemplate typbug

Source

prologic

👍5

Most helpful comment

We are using nginx and consul-template for service discovery and ran into the same problem, when the nginx nomad job is starting the service index in consul updated and result in several signals being sent. This makes the nginx nomad job impossible to start.

fffonion on 10 Jun 2020

👍3

All 13 comments

Full nomad alloc status logs:

Recent Events:
Time                  Type                   Description
2019-03-22T01:13:27Z  Killing                Template failed to send signals [user defined signal 1]: 1 error(s) occurred:

* Task not running
2019-03-22T01:13:23Z  Restarting             Task restarting in 31.906118053s
2019-03-22T01:13:23Z  Terminated             Exit Code: 0
2019-03-22T01:13:23Z  Signaling              Template re-rendered
2019-03-22T01:12:32Z  Started                Task started by client
2019-03-22T01:12:31Z  Downloading Artifacts  Client is downloading artifacts
2019-03-22T01:12:31Z  Task Setup             Building Task Directory
2019-03-22T01:12:31Z  Received               Task received by client

prologic on 22 Mar 2019

The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.
Screen Shot 2019-03-22 at 11 45 43 am

prologic on 22 Mar 2019

In fact the Job and its tasks just continue to fail even after a restart:

ID                  = 2dd17b65
Eval ID             = 61ea2f42
Name                = haproxy.haproxy[0]
Node ID             = 37cb9dd1
Job ID              = haproxy
Job Version         = 2
Client Status       = failed
Client Description  = Failed tasks
Desired Status      = run
Desired Description = <none>
Created             = 3m33s ago
Modified            = 3m4s ago

Task "haproxy" is "dead"
Task Resources
CPU        Memory           Disk     Addresses
0/200 MHz  6.9 MiB/128 MiB  300 MiB  haproxy_api: 10.0.64.160:4443
                                                                    haproxy_cloud_admin: 10.0.64.160:7559
                                                                    haproxy_mcfe: 10.0.64.160:443
                                                                    haproxy_reporter: 10.0.64.160:5555

Task Events:
Started At     = 2019-03-22T01:48:15Z
Finished At    = 2019-03-22T01:48:44Z
Total Restarts = 1
Last Restart   = 2019-03-22T01:48:43Z

Recent Events:
Time                  Type                   Description
2019-03-22T01:48:44Z  Killing                Template failed to send signals [user defined signal 1]: 1 error(s) occurred:

* Task not running
2019-03-22T01:48:43Z  Restarting             Task restarting in 34.83377889s
2019-03-22T01:48:43Z  Terminated             Exit Code: 0
2019-03-22T01:48:43Z  Signaling              Template re-rendered
2019-03-22T01:48:15Z  Started                Task started by client
2019-03-22T01:48:14Z  Downloading Artifacts  Client is downloading artifacts
2019-03-22T01:48:14Z  Task Setup             Building Task Directory
2019-03-22T01:48:14Z  Received               Task received by client

I don't actually expect the template to have changed in this case (after a stable cluster) -- But it does anyway?

prologic on 22 Mar 2019

Changing to change_mode = "restart" seems to work for me as a "work around". I'd still love to dig inot why change_mode = "signal" is so britle here; but alas I don't have the bandwidth today; so filing here for visibility.

prologic on 22 Mar 2019

That's quite intriguing - I'm investigating this now. Mind if you post haproxy config file as well? I haven't been able to reproduce the "Task not running" part. Also, I assume this running on centos 7 as well?

It's very peculiar that the task was terminated immediately when the template was re-rendered but the event about signaling failure occurred 4 seconds later:

2019-03-22T01:13:27Z  Killing                Template failed to send signals [user defined signal 1]: 1 error(s) occurred:

* Task not running
2019-03-22T01:13:23Z  Restarting             Task restarting in 31.906118053s
2019-03-22T01:13:23Z  Terminated             Exit Code: 0
2019-03-22T01:13:23Z  Signaling              Template re-rendered

The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.

This is the confusing UX problem raised in https://github.com/hashicorp/nomad/issues/5408#issuecomment-475083537 . When an alloc fails and expected to be restarted (note "Task restarting in.." event), the overall job is marked as running.

notnoop on 22 Mar 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale[bot] on 20 Jun 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:

stale[bot] on 20 Jul 2019

Nomad 0.9.3 - same problem:

Killing | Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running

After a few fails task can start, but it's annoying that it fails with this error several times.

pashinin on 22 Jul 2019

👍2

Same issue here using Nomad 0.9.3:

Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running

two out of three allocations fail with this error. one runs

peimanja on 22 Jul 2019

👍1

It's here: https://github.com/hashicorp/nomad/blob/ffb83e1ef182e04b8f625112cfe5cbaf1f314e08/client/allocrunner/taskrunner/template/template.go#L462-L465

If the signal fails to send for any reason (including because the task isn't running yet) then the task is marked as failed.

Maybe a new issue should be opened for this.

habnabit on 18 Aug 2019

fffonion on 10 Jun 2020

👍3

Ran into this with many consul templates (nginx configs) on 0.11.3.

luckymike on 30 Sep 2020

👍1

Hi, we encountered this same problem. It seems that if the template makes the task fail (some formatting problem or something like that kill the proccess), fixing the template doesn't make the allocation start again, because nomad can't make the change in the configuration due to the allocation not being running. Could it be something that collides between the template signaling and the restart behaviour?
evets