I believe that for true rolling updates of jobs, the updated alloc service endpoints should be removed from Consul first, wait for a grace period for active connections to drain and then restart..
@bsphere Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit.
The service exists and thus is registered in Consul, the only thing changing is its status which is reflected by checks.
Seems like a possible solution​, that requires support from the task side.
What about having the grace period in the job settings? This way "legacy"
code is still supported
On Mar 13, 2017 21:53, "Alex Dadgar" notifications@github.com wrote:
@bsphere https://github.com/bsphere Nomad can't decide what that grace
period is as it varies per job. The correct way to handle this is Nomad
sends a signal that the application is being shutdown. The application
should then fail its health check which will make consul not route traffic
to that instance while it starts draining connections/work and then it
should exit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/nomad/issues/2441#issuecomment-286223763,
or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB2LTHwPuwQCo8RuD7W4UcISJ4cT7PrJks5rlZ7DgaJpZM4MbuDq
.
I think this feature is really important, and that counting on the application to handle it is not always feasible solution.
The first part of this solution (deregistering the consul service first, before initiating the kill sequence) was achieved in #2596.
I believe the next important step is to introduce a delay between the service deregistration and the kill, configurable as part of the Nomad job spec, with the intent of giving other services in a distributed system (like a load balancer) ample time to stop interacting with the service before it is killed.
Please see relevant discussion in #2607 and #2596. I think I've made some important arguments there that haven't been raised here in this ticket yet.
Agree with everything @jemc posted. Ideally Nomad would put the related Consul service into maintenance mode with a configurable timeout (default of 1 second would already be enough in most cases) before initiating de-registration and SIGTERM. This is especially troublesome right now in combination with github.com/eBay/fabio. It takes a couple of 10's of ms before Fabio removes the route which leads to client side 503's. This is fairly problematic and I don't see a real nice solution for it except for introducing extra logic in all of our services.
This seems like a fairly trivial thing for Nomad to provide as opposed to the amount of development required to get every service handle a SIGTERM by first failing the health check, waiting and then shutting down.
Also consider the fact that not every service we run with Nomad is under our control (Nginx would be 1 of them).
I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping allocations (drain, stop job, deploy).
@dropje86 Thank you for posting this, i actually have a half written issue
that I was about to post today for exactly the same thing. This also
particularly affects consul integration in regard to templates and the
change_signal. The other use case is on deploy as well. It seems like nomad
should have all the information it needs to trigger a consul maint or
deregister and THEN kill/signal the alloc. This is going to be a big
problem as we can't have client connections just simply dropped as we run
deploys or change consul values.
For deploys there is pretty a fairly straightforward work around of
triggering a consul maint during the process but I think the use case we
we'd have to have nomad do it is during that consul kv update.
On Thu, Jul 27, 2017 at 1:54 PM dropje86 notifications@github.com wrote:
I think the title of this issue should be renamed to Graceful shutdown or
something, as this applies to all variations of stopping jobs (drain, stop
job, deploy).—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/nomad/issues/2441#issuecomment-318483647,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAqFRNl29lg31wGlJZ7MU-PT7isFXAjjks5sSPkNgaJpZM4MbuDq
.
This is really important for us. Right now we ignore the softkill signal so consul service gets De registered and a delay with kill_timeout. After that container is brutally killed. Providing a delay config would help us handling everything Gracefully @dadgar
Proposal:
job "docs" {
group "example" {
task "server" {
# ...
# Delay between deregister and kill signal
shutdown_delay = "5s"
}
}
}
Where shutdown_delay is the duration between deregistering services from Consul and sending the task the shutdown signal.
Defaults to 0 for backward compat.
@schmichael This is just insanely awesome. Thanks :heart: :100:
Thanks for the input everyone! 0.6.1 should be coming out soon with this feature.
@schmichael thank you for the attention on this, this will help with draining services a ton!
Thanks @smichael very helpful!
Most helpful comment
Proposal:
Where
shutdown_delayis the duration between deregistering services from Consul and sending the task the shutdown signal.Defaults to
0for backward compat.