Nomad: Nomad dies if HUP is sent during agent initialization

Created on 20 Feb 2018  ยท  5Comments  ยท  Source: hashicorp/nomad

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04

Issue

We're provisioning nomad with chef recipe that send reload on config update and restart on binary update:

hashistack_bring_binary 'nomad' do
  url node['nomad']['download_url']
  unzip true
  notifies :restart, 'systemd_unit[nomad.service]', :delayed
end

template '/etc/nomad.d/agent.hcl' do
  source 'nomad.hcl.erb'
  owner 'nomad'
  group 'root'
  mode '0644'
  sensitive true
  notifies :reload_or_try_restart, 'systemd_unit[nomad.service]', :delayed
end

systemd_unit 'nomad.service' do
  content <<-EOU.gsub(/^\s+/, '')
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target

[Service]
LimitNOFILE=65536
Restart=on-failure
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
RestartSec=5s

[Install]
WantedBy=multi-user.target
  EOU
  action [ :create, :enable ]
  notifies :restart, 'systemd_unit[nomad.service]', :delayed
end

Chef triggers restart and immediately after this reload.
On new machine if nomad data directory is not initialized and nomad receives HUP it just exits with 0:

# systemctl status nomad
โ— nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2018-02-19 09:08:02 UTC; 38s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 23627 (code=killed, signal=HUP)

Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Started Nomad.
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]:     Loaded configuration from /etc/nomad.d/agent.hcl
Feb 19 09:08:02 ip-x-x-x-x nomad[23627]: ==> Starting Nomad agent...
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloading Nomad.
Feb 19 09:08:02 ip-x-x-x-x systemd[1]: Reloaded Nomad.

On machine where data directory exists it succeeds to start and reload

Reproduction steps

Send HUP signal while nomad is initializing node(very fast)

stagaccepted themcore typbug

Most helpful comment

Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.

All 5 comments

Hi, thanks for opening this issue. We've added this to our team's near-term roadmap.

I just ran into this issue as well -> restart needed to update vault token (#4593) and reload triggered when nomad certificates were updated resulted in nomad unexpectedly stopping.

We have hit this a few times and it's always an expensive one to hunt down :/

For us, it happens when we're refreshing trust anchors (root CAs) that are rendered w/ consul-template. Downstream triggers cause a nomad cert rotation at the same time we restart nomad in order to refresh nomad's trust anchors (so we have a restart + sighup shortly after).

I was recently looking in this area of the code base and it's a bit tricky of a situation. It takes a certain amount of time for the agent to get itself setup to a point where it's safe to handle the signal for SIGHUP, and if we were to accept the signal before we could handle it, we'd end up dropping the reload signal in the situation described here instead (which would be worse).

We probably can set up the signal handler almost immediately but have it emit messages that we pick up in a loop that isn't started until later. This would minimize but not entirely get rid of the race condition in signals on startup -- there's just a certain amount of work that needs to be done before we can even know if we're starting up an agent or not.

Thanks for the feedback @tgross, The idea to send the signal to the channel, but start the reader of the signals when nomad is ready for this sounds really good!

Was this page helpful?
0 / 5 - 0 ratings