Nomad: Shutdown_delay not considered /w group defined services

Created on 14 Nov 2019 · 8Comments · Source: hashicorp/nomad

Nomad version

Output from nomad version

$ nomad -v
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

Amazon Linux 2

Issue

Allocations on shutdown do not seem to be respecting the shutdown_delay. I believe this may be because before, when services were mapped to a task, there is a 1:N correlation on which consul services to deregister before sending the kill signal. Now that the task does not have a service defined (since its in the group level), I believe it is completely ignoring shutdown_delay.

We see this happening in our production environment, where on an allocation shutdown, a kill signal is sent and the service terminates almost immediately, even though we have a shutdown-delay defined as 10s for the tasks within the group, resulting in problematic 502s.

Is this a known issue/regression from upgrading to network namespaces? Should there be a group-level shutdown_delay field introduced?

I see that shutdown_delay is included for the sidecar_task stanza, should this have been included in the more generic group stanza ??

themclient themconsuconnect typbug

Source

djenriquez

👍1

Most helpful comment

Another scenario for shutdown_delay is for sidecar jobs.

In my specific case, I have batch periodic jobs running every hour... they run really fast and generate some logs.

I have filebeat running in the same task group to send the logs to logstash but what I noticed is, the leader task finishes and filebeat did not have a chance to push the logs yet.

I have shutdown_delay = "30s" set in the filebeat task but that is not applied / respected when the leader finishes and the filebeat task is instructed to exit.

danlsgiga on 15 Nov 2019

👍2

All 8 comments

Thanks for the bug report @djenriquez! Shutdown delay was only implemented for task services, but should apply when using group services as well. The 2 implementation options I can think of are:

Each task respects its own shutdown_delay.
New group level shutdown_delay.

schmichael on 14 Nov 2019

@schmichael Thank you very much for the quick response! Not sure how difficult this work would be, sounds like it would require some struct changes, but would this be a quick one?

I think there is patience internally since our apps are mostly fault-tolerant, but if not, we may need to revert out of network namespaces as the 502s are not pretty to see in our highly dynamic environment.

djenriquez on 14 Nov 2019

Also regarding option 1, I'm not sure how that would work since the services being registered would represent all tasks in the group. You'd probably have to introduce logic to use the greatest shutdown delay of all the tasks.

djenriquez on 14 Nov 2019

1 wouldn't require any struct changes but is arguably the least user friendly: when an allocation is killed each task would wait its own shutdown_delay between deregistering services and sending the signal. So if you have 3 tasks in a group and only 1 sets `shutdown_delay` -- 2 of the tasks would be killed immediately.

schmichael on 14 Nov 2019

👍1

Ah I see, shutdown signal would be handled differently for each task. That makes a lot of sense, thanks for clarifying.

djenriquez on 15 Nov 2019

Another scenario for shutdown_delay is for sidecar jobs.

In my specific case, I have batch periodic jobs running every hour... they run really fast and generate some logs.

I have filebeat running in the same task group to send the logs to logstash but what I noticed is, the leader task finishes and filebeat did not have a chance to push the logs yet.

I have shutdown_delay = "30s" set in the filebeat task but that is not applied / respected when the leader finishes and the filebeat task is instructed to exit.

danlsgiga on 15 Nov 2019

👍2

@drewbailey should this issue have been closed by https://github.com/hashicorp/nomad/pull/6746?