Using Nomad 0.9.0-beta3 here.
I have a haproxy task that I'm trying to restart on template changes (from Consul Template) but the entire task/group fails with:
Template failed to send signals [user defined signal 1]: 1 error(s) occurred: * Task not running
--
And the job remains in a failed state.
Job spec looks something like this:
```#!hcl
job "haproxy" {
datacenters = ["dc1"]
type = "system"
update {
max_parallel = 1
min_healthy_time = "30s"
auto_revert = true
}
group "haproxy" {
restart {
interval = "6m"
attempts = 10
delay = "30s"
mode = "delay"
}
task "haproxy" {
driver = "exec"
artifact {
source = "http://packages.local:8000/haproxy-v1.9.4.tar.gz"
destination = "/"
options {
archive = "tar.gz"
}
}
template {
source = "/nomad/conf/configs/haproxy/haproxy.cfg.tpl"
destination = "/etc/haproxy/haproxy.cfg"
change_mode = "signal"
change_signal = "SIGUSR1"
}
config {
command = "usr/local/sbin/haproxy"
args= ["-f", "/etc/haproxy/haproxy.cfg"]
}
service {
name = "haproxy"
port = "xxx"
check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
resources {
cpu = 200
memory = 128
network {
mbits = 20
port "xxx" {
static = 4443"
}
}
}
}
}
}
```
Couple of questions:
Full nomad alloc status logs:
Recent Events:
Time Type Description
2019-03-22T01:13:27Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:13:23Z Restarting Task restarting in 31.906118053s
2019-03-22T01:13:23Z Terminated Exit Code: 0
2019-03-22T01:13:23Z Signaling Template re-rendered
2019-03-22T01:12:32Z Started Task started by client
2019-03-22T01:12:31Z Downloading Artifacts Client is downloading artifacts
2019-03-22T01:12:31Z Task Setup Building Task Directory
2019-03-22T01:12:31Z Received Task received by client
The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.

In fact the Job and its tasks just continue to fail even after a restart:
ID = 2dd17b65
Eval ID = 61ea2f42
Name = haproxy.haproxy[0]
Node ID = 37cb9dd1
Job ID = haproxy
Job Version = 2
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 3m33s ago
Modified = 3m4s ago
Task "haproxy" is "dead"
Task Resources
CPU Memory Disk Addresses
0/200 MHz 6.9 MiB/128 MiB 300 MiB haproxy_api: 10.0.64.160:4443
haproxy_cloud_admin: 10.0.64.160:7559
haproxy_mcfe: 10.0.64.160:443
haproxy_reporter: 10.0.64.160:5555
Task Events:
Started At = 2019-03-22T01:48:15Z
Finished At = 2019-03-22T01:48:44Z
Total Restarts = 1
Last Restart = 2019-03-22T01:48:43Z
Recent Events:
Time Type Description
2019-03-22T01:48:44Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:48:43Z Restarting Task restarting in 34.83377889s
2019-03-22T01:48:43Z Terminated Exit Code: 0
2019-03-22T01:48:43Z Signaling Template re-rendered
2019-03-22T01:48:15Z Started Task started by client
2019-03-22T01:48:14Z Downloading Artifacts Client is downloading artifacts
2019-03-22T01:48:14Z Task Setup Building Task Directory
2019-03-22T01:48:14Z Received Task received by client
I don't actually expect the template to have changed in this case (after a stable cluster) -- But it does anyway?
Changing to change_mode = "restart" seems to work for me as a "work around". I'd still love to dig inot why change_mode = "signal" is so britle here; but alas I don't have the bandwidth today; so filing here for visibility.
That's quite intriguing - I'm investigating this now. Mind if you post haproxy config file as well? I haven't been able to reproduce the "Task not running" part. Also, I assume this running on centos 7 as well?
It's very peculiar that the task was terminated immediately when the template was re-rendered but the event about signaling failure occurred 4 seconds later:
2019-03-22T01:13:27Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:13:23Z Restarting Task restarting in 31.906118053s
2019-03-22T01:13:23Z Terminated Exit Code: 0
2019-03-22T01:13:23Z Signaling Template re-rendered
The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.
This is the confusing UX problem raised in https://github.com/hashicorp/nomad/issues/5408#issuecomment-475083537 . When an alloc fails and expected to be restarted (note "Task restarting in.." event), the overall job is marked as running.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
Nomad 0.9.3 - same problem:
Killing | Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running
After a few fails task can start, but it's annoying that it fails with this error several times.
Same issue here using Nomad 0.9.3:
Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running
two out of three allocations fail with this error. one runs
It's here: https://github.com/hashicorp/nomad/blob/ffb83e1ef182e04b8f625112cfe5cbaf1f314e08/client/allocrunner/taskrunner/template/template.go#L462-L465
If the signal fails to send for any reason (including because the task isn't running yet) then the task is marked as failed.
Maybe a new issue should be opened for this.
We are using nginx and consul-template for service discovery and ran into the same problem, when the nginx nomad job is starting the service index in consul updated and result in several signals being sent. This makes the nginx nomad job impossible to start.
Ran into this with many consul templates (nginx configs) on 0.11.3.
Hi, we encountered this same problem. It seems that if the template makes the task fail (some formatting problem or something like that kill the proccess), fixing the template doesn't make the allocation start again, because nomad can't make the change in the configuration due to the allocation not being running. Could it be something that collides between the template signaling and the restart behaviour?

When the template render kicks in, the task is set to be killed (it's already), and nomad doesn't try to restart it again.
Most helpful comment
We are using nginx and consul-template for service discovery and ran into the same problem, when the nginx nomad job is starting the service index in consul updated and result in several signals being sent. This makes the nginx nomad job impossible to start.