Nomad v0.3.2
Ubuntu 14.04
Nomad client puts some of the temporary files to /tmp. In the situation when alloc_dir is not erased on reboot, Nomad client tries to restart all the allocations that were run previously on this node and terminates them after receiving a command from server (note Started,Started, Killed event sequence in the allocation status). Since the socket in /tmp is gone after reboot, the client produces an error log (see below).
In our setup we solved it by putting alloc_dir to /tmp.
ID = infra-cluster-broccoli
Name = infra-cluster-broccoli
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
==> Evaluations
ID Priority Triggered By Status
194b03f7 50 node-update complete
36b8d462 50 node-update complete
319c009c 50 node-update complete
==> Allocations
ID Eval ID Node ID Task Group Desired Status
6d9c70aa 36b8d462 3df25888 server run running
c130ac62 319c009c 449c9ba8 server stop complete
ID = c130ac62
Eval ID = 319c009c
Name = infra-cluster-broccoli.server[0]
Node ID = 449c9ba8
Job ID = infra-cluster-broccoli
Client Status = complete
==> Task Resources
Task: "server"
CPU Memory MB Disk MB IOPS Addresses
500 1024 300 0 http: 10.250.18.29:9000
==> Task "server" is "dead"
Recent Events:
Time Type Description
06/10/16 14:12:30 CEST Killed Task successfully killed
06/10/16 14:12:23 CEST Started Task started by client
04/10/16 16:59:45 CEST Started Task started by client
04/10/16 16:59:44 CEST Downloading Artifacts Client is downloading artifacts
04/10/16 16:59:44 CEST Received Task received by client
log_level = "INFO"
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
advertise {
http = "10.250.18.28:4646"
rpc = "10.250.18.28:4647"
serf = "10.250.18.28:4648"
}
client {
enabled = true
servers = ["10.250.18.27"]
options {
"driver.raw_exec.enable" = "1"
"driver.exec.enable" = "1"
"driver.docker.enable" = "1"
}
}
==> Caught signal: terminated
2016/10/06 14:10:43 [INFO] agent: requesting shutdown
2016/10/06 14:10:43 [INFO] client: shutting down
2016/10/06 14:10:43 [INFO] agent: shutdown complete
Loaded configuration from /etc/nomad.d/client/config.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Atlas: <disabled>
Client: true
Log Level: INFO
Region: global (DC: dc1)
Server: false
==> Nomad agent started! Log data will stream in below:
2016/10/06 14:12:19 [INFO] client: using state directory /var/lib/nomad/client
2016/10/06 14:12:19 [INFO] client: using alloc directory /var/lib/nomad/alloc
2016/10/06 14:12:19 [INFO] fingerprint.cgroups: cgroups are available
2016/10/06 14:12:23 [WARN] fingerprint.env_gce: Could not read value for attribute "machine-type"
2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
2016/10/06 14:12:23 [WARN] client: port not specified, using default port
2016/10/06 14:12:23 [INFO] client: setting server address list: [10.250.18.27:4647]
2016/10/06 14:12:23 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
2016/10/06 14:12:23 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:
* os: process already finished
* os: process already finished
2016/10/06 14:12:23 [ERR] client: failed to open handle to task 'server' for alloc 'c130ac62-268f-3ae8-3aac-95d315f37b99': error connecting to plugin: error creating rpc client for executor plugin: Reattachment process not found
Hey @Gerrrr,
This is actually expected behavior. So what the client is doing is attempting to re-attach to anything that was already running. This can be useful if you kill Nomad Client and do an in-place upgrade for example, start it up again and have it find all the processes.
In your case, there is nothing to connect to anymore because the tasks are dead, so it is just cleaning up.
Let me know if that made sense!
Hi @dadgar,
Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.
In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent SIGKILL to all jobs which are just started.
However, SIGKILL cannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by putting alloc_dir to /tmp so that the client does not try to run previously allocated jobs after reboot.
Ah thanks for the clarification. Will re-open
Most helpful comment
Hi @dadgar,
Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.
In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent
SIGKILLto all jobs which are just started.However,
SIGKILLcannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by puttingalloc_dirto/tmpso that the client does not try to run previously allocated jobs after reboot.