Output from nomad version
$ nomad -v
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)
Amazon Linux 2
Allocations on shutdown do not seem to be respecting the shutdown_delay. I believe this may be because before, when services were mapped to a task, there is a 1:N correlation on which consul services to deregister before sending the kill signal. Now that the task does not have a service defined (since its in the group level), I believe it is completely ignoring shutdown_delay.
We see this happening in our production environment, where on an allocation shutdown, a kill signal is sent and the service terminates almost immediately, even though we have a shutdown-delay defined as 10s for the tasks within the group, resulting in problematic 502s.
Is this a known issue/regression from upgrading to network namespaces? Should there be a group-level shutdown_delay field introduced?
I see that shutdown_delay is included for the sidecar_task stanza, should this have been included in the more generic group stanza ??
Thanks for the bug report @djenriquez! Shutdown delay was only implemented for task services, but should apply when using group services as well. The 2 implementation options I can think of are:
@schmichael Thank you very much for the quick response! Not sure how difficult this work would be, sounds like it would require some struct changes, but would this be a quick one?
I think there is patience internally since our apps are mostly fault-tolerant, but if not, we may need to revert out of network namespaces as the 502s are not pretty to see in our highly dynamic environment.
Also regarding option 1, I'm not sure how that would work since the services being registered would represent all tasks in the group. You'd probably have to introduce logic to use the greatest shutdown delay of all the tasks.
shutdown_delay -- 2 of the tasks would be killed immediately.Ah I see, shutdown signal would be handled differently for each task. That makes a lot of sense, thanks for clarifying.
Another scenario for shutdown_delay is for sidecar jobs.
In my specific case, I have batch periodic jobs running every hour... they run really fast and generate some logs.
I have filebeat running in the same task group to send the logs to logstash but what I noticed is, the leader task finishes and filebeat did not have a chance to push the logs yet.
I have shutdown_delay = "30s" set in the filebeat task but that is not applied / respected when the leader finishes and the filebeat task is instructed to exit.
@drewbailey should this issue have been closed by https://github.com/hashicorp/nomad/pull/6746?
Yes thanks, not sure why it didn't auto-close :(
Most helpful comment
Another scenario for
shutdown_delayis for sidecar jobs.In my specific case, I have batch periodic jobs running every hour... they run really fast and generate some logs.
I have filebeat running in the same task group to send the logs to logstash but what I noticed is, the
leadertask finishes and filebeat did not have a chance to push the logs yet.I have
shutdown_delay = "30s"set in the filebeat task but that is not applied / respected when the leader finishes and thefilebeattask is instructed to exit.