Nomad: Need a way to cleanly shut down nodes

Created on 1 Dec 2016 · 13Comments · Source: hashicorp/nomad

Nomad v0.5.0

There doesn't appear to be a way to cleanly shut down a client node in a way that allows allocations to be moved to other nodes and also accounts for the data in sticky ephemeral disks to be migrated. I wrote a script to help my systemd service delay stopping the service until allocations have been moved, but there doesn't appear to be a way to monitor for the status of the migrated data. If the data's not moved quickly, it could be lost when the node shuts down.

Something like nomad shutdown that blocks until the agent is completely idle would be ideal.

stagneeds-discussion themclient typenhancement

Source

blalor

👍8 ❤1

All 13 comments

Consul has the leave command. Would be nice to have a similar command in nomad, which would trigger a node drain, wait for it to complete, and then gracefully leave the cluster.

groggemans on 3 Jun 2018

@groggemans Nomad 0.8 added advanced node draining features. Some useful links:

Node drain command
Blog post that explains node draining features

preetapan on 4 Jun 2018

I know, and it solves/implements te draining part, but then the node should still gracefully leave the cluster. And I think the only way to do this now is by stopping/interrupting the service (with leave_on_terminate = true or leave_on_interrupt = true).

Setting leave_on_interrupt or leave_on_terminate to true isn't always desirable, but it should still be possible to do a graceful leave from the cli even when both options are false (default).

For servers there's the force-leave option, but for clients there's no command to do a graceful leave. A universal leave that would work for both servers and clients which also triggers a node drain seems to be missing.

groggemans on 4 Jun 2018

A relevant discussion happened in #4305. @insanejudge, @schmichael.

Indeed draining node on shutdown is the best and service file could be adjusted to do that. However graceful restart can not be implemented in systemd service because it can not distinguish shut down from restart. Regardless, KillMode=control-group (default) is better than KillMode=process because the latter does not guarantee cleanup. It is important to leave no unmanaged processes behind.

onlyjob on 15 Aug 2018

Not sure if this is relevant or related - but even when I have leave_on_terminate set in my config - it doesn't seem to fully leave. I've been doing some testing with the stuff above, and I can see in my logs the node is cleanly shutting down:

nomad: ==> Caught signal: terminated
nomad: ==> Gracefully shutting down agent...
nomad[2525]: agent: requesting shutdown
nomad: 2018/09/12 16:11:52.306399 [INFO] agent: requesting shutdown
nomad: 2018/09/12 16:11:52.306468 [INFO] client: shutting down
nomad[2525]: client: shutting down
nomad: 2018/09/12 16:11:52.320998 [INFO] agent: shutdown complete
nomad[2525]: agent: shutdown complete

but when I check nomad node status it still shows down:

$ nomad node status
ID        DC                 Name   Class   Drain  Eligibility  Status
4995dacd  east     agent1     <none>  false  ineligible   down

is that expected behavior? I would expect once the node leaves the cluster it doesn't appear in the status anymore.

dcparker88 on 12 Sep 2018

@onlyjob Does systemd allow configuring different signals for reloads, restarts, and shutdowns? If so we could use SIGHUP, SIGINT, and SIGTERM respectively to separate the shutdown behaviors. Adding APIs+CLI commands would also be useful. This is definitely something we're hoping to do, but I don't know if it will make it into 0.9.0.

@dcparker88 Unfortunately leave_on_terminate is not implemented for clients, so yes, that is expected.

schmichael on 12 Sep 2018

👍1

No it doesn't... There is ExecStop but not ExecRestart... Anyway IMHO it is _wrong_ to distinguish. Node should be drained on restart as well because it is the only safe approach. If updated executable fail to start then system will end up with dangled unaccounted services.

onlyjob on 13 Sep 2018

@onlyjob Nomad will continue to support inplace upgrades (restarting without draining) for at least a couple of reasons:

Some jobs are expensive to restart/migrate (QEMU VMs)
We do not want to tie the lifetime/stability of the Nomad client agent to all of the tasks it runs. We try to isolate defects in our code from affecting user services.

That being said we've definitely come close to dropping support for inplace upgrades. I could see it happening someday but for now we intend to support restarts that don't affect tasks.

schmichael on 13 Sep 2018

It is OK if you are committed to support restart without draining. However this is unsafe and therefore should be configurable. Moreover draining node on restart must be default behaviour. It is not OK to leave dangling VMs because they are not cheap to restart.
It is a classic "speed over safety" dilemma.

Betting on perfect stability of the Nomad client is a strategy for the perfect world, like saying that defects in your code (will) never happen.
One day something unforeseen will happen on architecture that your CI does not cover and client will fail to start for whatever reason - could be low memory condition for example - how do you know if there will be enough memory available to start Nomad if it doesn't terminate its jobs?

http://thecodelesscode.com/case/96

onlyjob on 13 Sep 2018

@schmichael ah thanks - that makes sense then. Do you know if that's a planned feature, or should I just continue to use GC to clean out down nodes?

dcparker88 on 13 Sep 2018

@dcparker88: Do you know if that's a planned feature, or should I just continue to use GC to clean out down nodes?

We hoped shutdown improvements would land in 0.9.0, but some larger features (eg plugins) take priority so you may want to continue using GC for the time being. If they don't make it in 0.9.0, hopefully we'll get them out in a patch release.

@onlyjob: ...therefore should be configurable. Moreover draining node on restart must be default behaviour.

This is the plan!

@onlyjob: Betting on perfect stability of the Nomad client is a strategy for the perfect world, like saying that defects in your code (will) never happen.

This is precisely reason 2 I gave above for supporting inplace upgrades. A guiding principle in Nomad's design is in the face of errors: do not stop user services! Nomad downtime should prevent further scheduling, but it should avoid causing service downtime as much as possible.

schmichael on 14 Sep 2018

👍1

Thanks. :) I think there is a flaw in this reasoning... We need to separate two issues: avoiding stopping services during normal operations and a case when Nomad client itself is restarting.
It violate principles of integrity and common sense to leave scheduled jobs running when nomad client exited...
Service downtime is _necessary_ when manager/dispatcher is restarting because it is the only safe mode of operations.

What if updated Nomad disagrees with running Docker on version of API?

onlyjob on 15 Sep 2018

Basically what we want is, if nomad itself is running into unexpected issues, leave the task runtime alone and confine nomad issue to be just nomad issue as much as possible (smallest blast radius possible). On the other hand, if it's an intentional shutdown of nomad client, provides a way to trigger a clean shutdown of task runtimes

I think there might be a fine line here @schmichael

Ideally, if nomad client itself crashes or shutdown b/c not operator initiated reasons, it should not trigger task shutdown. Only if it's an operator initiated shutdown, it triggers (and waits for the finish of) clean shutdown of all tasks.

So would a signal be a good way to indicate it's an intentional shutdown? Like, instead of having client.drain_shutdown = true, how about client.drain_shutdown_signal = SIGINT something along that line.

Would a client.drain_shutdown = true agent configuration parameter fit your use case? The idea being that when the nomad client received the signal to shutdown it would block exiting until it had drained all running allocations?