Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
We're provisioning nomad with chef recipe that send reload on config update and restart on binary update:
hashistack_bring_binary 'nomad' do
url node['nomad']['download_url']
unzip true
notifies :restart, 'systemd_unit[nomad.service]', :delayed
end
template '/etc/nomad.d/agent.hcl' do
source 'nomad.hcl.erb'
owner 'nomad'
group 'root'
mode '0644'
sensitive true
notifies :reload_or_try_restart, 'systemd_unit[nomad.service]', :delayed
end
systemd_unit 'nomad.service' do
content <<-EOU.gsub(/^\s+/, '')
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target
[Service]
LimitNOFILE=65536
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOU
action [ :create, :enable ]
notifies :restart, 'systemd_unit[nomad.service]', :delayed
end
Chef triggers restart and immediately after this reload.
On new machine if nomad data directory is not initialized and nomad receives HUP it just exits with 0:
# systemctl status nomad
โ nomad.service - Nomad
Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2018-02-19 09:08:02 UTC; 38s ago
Docs: https://nomadproject.io/docs/
Main PID: 23627 (code=killed, signal=HUP)
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Started Nomad.
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: Loaded configuration from /etc/nomad.d/agent.hcl
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: ==> Starting Nomad agent...
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloading Nomad.
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloaded Nomad.
On machine where data directory exists it succeeds to start and reload
Send HUP signal while nomad is initializing node(very fast)
Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.
I just ran into this issue as well -> restart needed to update vault token (#4593) and reload triggered when nomad certificates were updated resulted in nomad unexpectedly stopping.
We have hit this a few times and it's always an expensive one to hunt down :/
For us, it happens when we're refreshing trust anchors (root CAs) that are rendered w/ consul-template. Downstream triggers cause a nomad cert rotation at the same time we restart nomad in order to refresh nomad's trust anchors (so we have a restart + sighup shortly after).
I was recently looking in this area of the code base and it's a bit tricky of a situation. It takes a certain amount of time for the agent to get itself setup to a point where it's safe to handle the signal for SIGHUP, and if we were to accept the signal before we could handle it, we'd end up dropping the reload signal in the situation described here instead (which would be worse).
We probably can set up the signal handler almost immediately but have it emit messages that we pick up in a loop that isn't started until later. This would minimize but not entirely get rid of the race condition in signals on startup -- there's just a certain amount of work that needs to be done before we can even know if we're starting up an agent or not.
Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!
Most helpful comment
Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.