Nomad: Configurable delay between deregistering service and killing task

Created on 13 Mar 2017 · 13Comments · Source: hashicorp/nomad

I believe that for true rolling updates of jobs, the updated alloc service endpoints should be removed from Consul first, wait for a grace period for active connections to drain and then restart..

stagthinking themdiscovery

Source

bsphere

👍8

Most helpful comment

Proposal:

job "docs" {
  group "example" {
    task "server" {
      # ...

      # Delay between deregister and kill signal
      shutdown_delay = "5s"
    }
  }
}

Where shutdown_delay is the duration between deregistering services from Consul and sending the task the shutdown signal.

Defaults to 0 for backward compat.

schmichael on 17 Aug 2017

👍9 ❤4 🎉4

All 13 comments

@bsphere Nomad can't decide what that grace period is as it varies per job. The correct way to handle this is Nomad sends a signal that the application is being shutdown. The application should then fail its health check which will make consul not route traffic to that instance while it starts draining connections/work and then it should exit.

The service exists and thus is registered in Consul, the only thing changing is its status which is reflected by checks.

dadgar on 13 Mar 2017

👍2

Seems like a possible solution, that requires support from the task side.

What about having the grace period in the job settings? This way "legacy"
code is still supported

On Mar 13, 2017 21:53, "Alex Dadgar" notifications@github.com wrote:

@bsphere https://github.com/bsphere Nomad can't decide what that grace
period is as it varies per job. The correct way to handle this is Nomad
sends a signal that the application is being shutdown. The application
should then fail its health check which will make consul not route traffic
to that instance while it starts draining connections/work and then it
should exit.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/nomad/issues/2441#issuecomment-286223763,
or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB2LTHwPuwQCo8RuD7W4UcISJ4cT7PrJks5rlZ7DgaJpZM4MbuDq
.

bsphere on 13 Mar 2017

I think this feature is really important, and that counting on the application to handle it is not always feasible solution.

The first part of this solution (deregistering the consul service first, before initiating the kill sequence) was achieved in #2596.

I believe the next important step is to introduce a delay between the service deregistration and the kill, configurable as part of the Nomad job spec, with the intent of giving other services in a distributed system (like a load balancer) ample time to stop interacting with the service before it is killed.

Please see relevant discussion in #2607 and #2596. I think I've made some important arguments there that haven't been raised here in this ticket yet.

jemc on 4 May 2017

👍4

Agree with everything @jemc posted. Ideally Nomad would put the related Consul service into maintenance mode with a configurable timeout (default of 1 second would already be enough in most cases) before initiating de-registration and SIGTERM. This is especially troublesome right now in combination with github.com/eBay/fabio. It takes a couple of 10's of ms before Fabio removes the route which leads to client side 503's. This is fairly problematic and I don't see a real nice solution for it except for introducing extra logic in all of our services.

This seems like a fairly trivial thing for Nomad to provide as opposed to the amount of development required to get every service handle a SIGTERM by first failing the health check, waiting and then shutting down.

ygersie on 27 Jul 2017

❤4

Also consider the fact that not every service we run with Nomad is under our control (Nginx would be 1 of them).

ygersie on 27 Jul 2017

I think the title of this issue should be renamed to Graceful shutdown or something, as this applies to all variations of stopping allocations (drain, stop job, deploy).

ygersie on 27 Jul 2017

@dropje86 Thank you for posting this, i actually have a half written issue
that I was about to post today for exactly the same thing. This also
particularly affects consul integration in regard to templates and the
change_signal. The other use case is on deploy as well. It seems like nomad
should have all the information it needs to trigger a consul maint or
deregister and THEN kill/signal the alloc. This is going to be a big
problem as we can't have client connections just simply dropped as we run
deploys or change consul values.

For deploys there is pretty a fairly straightforward work around of
triggering a consul maint during the process but I think the use case we
we'd have to have nomad do it is during that consul kv update.
On Thu, Jul 27, 2017 at 1:54 PM dropje86 notifications@github.com wrote:

I think the title of this issue should be renamed to Graceful shutdown or
something, as this applies to all variations of stopping jobs (drain, stop
job, deploy).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/nomad/issues/2441#issuecomment-318483647,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAqFRNl29lg31wGlJZ7MU-PT7isFXAjjks5sSPkNgaJpZM4MbuDq
.

mlehner616 on 28 Jul 2017

This is really important for us. Right now we ignore the softkill signal so consul service gets De registered and a delay with kill_timeout. After that container is brutally killed. Providing a delay config would help us handling everything Gracefully @dadgar

skyrocknroll on 13 Aug 2017

Proposal:

job "docs" {
  group "example" {
    task "server" {
      # ...

      # Delay between deregister and kill signal
      shutdown_delay = "5s"
    }
  }
}

Where shutdown_delay is the duration between deregistering services from Consul and sending the task the shutdown signal.

Defaults to 0 for backward compat.

schmichael on 17 Aug 2017

👍9 ❤4 🎉4

@schmichael This is just insanely awesome. Thanks :heart: :100:

skyrocknroll on 17 Aug 2017

Thanks for the input everyone! 0.6.1 should be coming out soon with this feature.

schmichael on 17 Aug 2017

@schmichael thank you for the attention on this, this will help with draining services a ton!

mlehner616 on 19 Aug 2017

Thanks @smichael very helpful!

ygersie on 19 Aug 2017