Containers-roadmap: [ECS] [request]: Tasks scheduled with the DAEMON strategy should not be stopped when container instance is placed into DRAINING

Created on 23 Jan 2019 · 27Comments · Source: aws/containers-roadmap

Tell us about your request
A typical use case of the DAEMON scheduling strategy is to run daemons that monitor containers, collect logs, etc. Imagine the following scenario:

Schedule a DAEMON service on a cluster which runs a daemon responsible for collecting logs for all other tasks (logstash, fluentd, etc)
Put a container instance into DRAINING
The DAEMON service potentially gets killed before other tasks (per the documentation)
Logs from the remaining tasks never get shipped due to ECS killing logstash, fluentd, etc.

Kubernetes solves for this issue by not killing containers scheduled by the DaemonSet controller when a node is put into DRAINING. Similar functionality is desired in ECS. Otherwise, in this scenario, logs can potentially get lost.

Which service(s) is this request for?
ECS

Are you currently working around this issue?
We run logstash/fluentd outside of ECS so that it remains running at all times.

ECS Proposed

Source

joshuabaird

👍95 👀1

Most helpful comment

Hi Everyone,

Thank you so much for all the feedback regarding the this Issue:
We are currently working on a feature (coming soon)for Daemon Scheduler that will be available out of the box for all customers:
ECS will now ensure that Daemon tasks are the last task to drain from an Instance - this will allow the monitoring daemons to pick up trailing logs or metrics. This should help resolve the issue as defined above. I understand that there was also an ask to not drain/stop the daemon at all, however we believe that certain customers use Daemon tasks for application containers as well, hence the decision to gracefully shut them down but only after the replica (application) tasks have been stopped.

Hope this helps!

pavneeta on 2 Nov 2020

🎉10 🚀3 ❤3 👍1

All 27 comments

+1 to this, but please behind some kind of flag at the task level.

Most of our Daemon tasks fall under this requirement: they're metrics/logging/etc containers so they have to be running all the time on every instance and would benefit from not being stopped after a drain.

However we also have clusters dedicated to running some heavyweight applications, where we want only one task per instance. Daemon tasks are great for this, we scale on the EC2 instances instead of ECS and rely on ECS to manage everything else. But we want those containers stopped after a drain, leaving only the metrics/logging/etc ones running.

xose on 5 Feb 2019

👍17

Yes... +1 but with the flag at the task level..

nunofernandes on 6 Feb 2019

👍2

Even just killing REPLICA tasks before DAEMON tasks would be nice.

eedwards-sk on 12 Apr 2019

👍16

Would also be great if this is applied to scheduled tasks (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html). Thanks!

dsouzajude on 3 Jun 2019

+1 with a flag to say if they remain in DRAINING or not

CpuID on 19 Aug 2019

When messing about with the draining state the following issue might be worthwhile to take under consideration as well for implementation at the same time.
[ecs] - de-register from Cloud Map / R53 when instance is draining
https://github.com/aws/containers-roadmap/issues/473

jespersoderlund on 16 Sep 2019

This is also a problem for DAEMON tasks that are connected to a Load Balancer. I would expect the ECS scheduler to respect the draining timeout set for the tasks target group.

What happens is that ECS tries to;

Stop the task
Deregister the task from the target group
Start draining process

All at the same time

2019-11-06 11:38:54 +0100
service tracker has reached a steady state.
658253d0-09ba-417c-9d09-278ab036e37b
2019-11-06 11:38:44 +0100
service tracker has begun draining connections on 1 tasks.
cfc8396e-c0d1-4aea-abd3-77f7495a63e5
2019-11-06 11:38:44 +0100
service tracker deregistered 1 targets in target-group X
55806ae5-4492-4b9b-bc8f-128b7a758e33
2019-11-06 11:38:44 +0100
service tracker has stopped 1 running tasks: task 0f9375327d974fadb36b14203669ed55.
ab25021d-1e9d-4dd4-b456-15bda44a248b
2019-11-06 11:38:44 +0100
(daemon service tracker) task 0f9375327d974fadb36b14203669ed55 no longer satisfies placement constraints.
85b57279-59d1-402b-a106-348ef82cf3f4
2019-11-06 11:38:44 +0100
(daemon service tracker) updated desired count to 2.

toredash on 6 Nov 2019

👍5

We worked around this issue for now by blocking SIGTERM that is sent to the DAEMON container. Then we able to drain connections.

toredash on 16 Nov 2019

👍2

+1 to this, but please behind some kind of flag at the task level.

Most of our Daemon tasks fall under this requirement: they're metrics/logging/etc containers so they have to be running all the time on every instance and would benefit from not being stopped after a drain.

However we also have clusters dedicated to running some heavyweight applications, where we want only one task per instance. Daemon tasks are great for this, we scale on the EC2 instances instead of ECS and rely on ECS to manage everything else. But we want those containers stopped after a drain, leaving only the metrics/logging/etc ones running.

Hi @xose We are currently working on the design phase of this feature to improve the Daemon Service Scheduler for ECS. Could you please help us understand the first use case a bit more? You want the Daemon tasks to never be killed on the instance, even when it has been drained of the application tasks? Are you constantly gathering logs/metric on empty EC2 instances as well? - if so, then could you share the reason?

Based on the feedback here, my understanding is that the Daemon service needs to satisfy the below conditions:

ECS should ensure that there is one daemon task per instance
It should be the first task to be scheduled on any given Instance
It should be the last task to be killed when an Instance is drained/stopped

If the ECS Daemon service satisfies these conditions , does it solve for your use case?

Would love to get some feedback from everyone on this Issue created. Thanks.

pavneeta on 21 Feb 2020

👍8

ECS should ensure that there is one daemon task per instance

It should be the first task to be scheduled on any given Instance

It should be the last task to be killed when an Instance is drained/stopped

For my use case where we have logging, x-ray and other metric collecting daemon services, this would be perfect. As long as these are the last containers to stop, then we won't be losing monitoring as the containers are drained.

RyanFrench on 21 Feb 2020

I would be fine with 1 and 2 as stated.

3 is good or it would also be fine (maybe preferred) to leave the tasks even after draining. This would have the ability to pick up non container logs/metrics during shutdown

rothgar on 21 Feb 2020

I worked around by dependsOn in container definition.

https://github.com/waneal/ecs-daemon-protector

waneal on 29 Feb 2020

👍4

I would be fine with 1 and 2 as stated.

3 is good or it would also be fine (maybe preferred) to leave the tasks even after draining. This would have the ability to pick up non container logs/metrics during shutdown

Hi @rothgar , Could you please help me understand this more, If you are drained the host EC2 instance then it has no replica tasks running - why would you want to run Daemon task on it ? Wouldn't prefer that instance be scaled in ?

pavneeta on 3 May 2020

Sometimes instances are drained but left running for troubleshooting or testing. The daemon tasks are used for logging and metrics from the host no matter if other ECS tasks are running on it or not.

rothgar on 3 May 2020

👍2

@pavneeta Could you please provide an update? Will there be an AWS native solution to this problem soon?

joshuabaird on 18 Aug 2020

👍1

@pavneeta are you able to provide an update if this is coming shortly on the roadmap...?

CpuID on 17 Sep 2020

@pavneeta Any update on this?

joshuamoore on 18 Sep 2020

Just hit this again today :(

CpuID on 27 Oct 2020

Hi @xose We are currently working on the design phase of this feature to improve the Daemon Service Scheduler for ECS. Could you please help us understand the first use case a bit more? You want the Daemon tasks to never be killed on the instance, even when it has been drained of the application tasks? Are you constantly gathering logs/metric on empty EC2 instances as well? - if so, then could you share the reason?

Based on the feedback here, my understanding is that the Daemon service needs to satisfy the below conditions:

ECS should ensure that there is one daemon task per instance

It should be the first task to be scheduled on any given Instance

It should be the last task to be killed when an Instance is drained/stopped

If the ECS Daemon service satisfies these conditions , does it solve for your use case?

Would love to get some feedback from everyone on this Issue created. Thanks.

Hi @pavneeta

Some extra feedback for you re your questions, yes to all points 1/2/3.

One consideration will be if you have multiple DAEMON services, if an order of startup/shutdown needs to be considered. For example, you have something like Consul Agent (for service discovery) + Filebeat (for log aggregation) running on each node, and you want to ensure they both stay up until all non-DAEMON ECS tasks are terminated, then take down Consul Agent, then take down Filebeat (or vice versa).

I think just taking down all the DAEMON services together at the end is likely fine as a first pass of this feature though, its hard to decide/tradeoff what should stay running longer than others after non-DAEMON tasks are terminated.

CpuID on 27 Oct 2020

Hi Everyone,

Hope this helps!

pavneeta on 2 Nov 2020

🎉10 🚀3 ❤3 👍1

ECS will now ensure that Daemon tasks are the last task to drain from an Instance - this will allow the monitoring daemons to pick up trailing logs or metrics.

amazing, can't wait @pavneeta thank you!

CpuID on 2 Nov 2020

Hi @pavneeta ,

This is great news indeed, thanks for sharing. I do have one question though - I don't believe there is currently a mechanism to specify dependencies between Daemon services? Can you explain a bit about what the termination ordering would be _within the daemon services running_ on each ECS Container Instance?

Many thanks,

Edd

eddgrant on 2 Nov 2020

Hi Everyone,

Thank you so much for all the feedback regarding the this Issue:
We are currently working on a feature (coming soon)for Daemon Scheduler that will be available out of the box for all customers:
ECS will now ensure that Daemon tasks are the last task to drain from an Instance - this will allow the monitoring daemons to pick up trailing logs or metrics. This should help resolve the issue as defined above. I understand that there was also an ask to not drain/stop the daemon at all, however we believe that certain customers use Daemon tasks for application containers as well, hence the decision to gracefully shut them down but only after the replica (application) tasks have been stopped.

Hope this helps!

Hi Everyone, This feature has been shipped, all customers will be see this behavior by default with no opt-in required.

pavneeta on 30 Nov 2020

🎉2

Hi @pavneeta ,

This is great news indeed, thanks for sharing. I do have one question though - I don't believe there is currently a mechanism to specify dependencies between Daemon services? Can you explain a bit about what the termination ordering would be _within the daemon services running_ on each ECS Container Instance?

Many thanks,

Edd

@eddgrant - ECS will not follow a specific order of draining between multiple Daemon services. At this time there is no way for your define dependency between daemon services. Can you please share more information regarding your use case for that?

pavneeta on 30 Nov 2020

Hi @pavneeta

Regarding this statement...

I understand that there was also an ask to not drain/stop the daemon at all, however we believe that certain customers use Daemon tasks for application containers as well, hence the decision to gracefully shut them down but only after the replica (application) tasks have been stopped.

Is that the end of it or is there anything on the roadmap to fulfil the requirement.

We are trying to move as many of our "agent" type services out of the host and in to priviledged daemon containers. For example, running fluentd container with mounted volume of /var/log rather than fluentd on the host.

This is impossible if you're going to always kill the daemons though as we'll end up stop {DOING X} for whatever host is in draining mode. This could be logs, host metrics, security scanning, etc.

wimnat on 1 Dec 2020

Hi @pavneeta

Regarding this statement...

I understand that there was also an ask to not drain/stop the daemon at all, however we believe that certain customers use Daemon tasks for application containers as well, hence the decision to gracefully shut them down but only after the replica (application) tasks have been stopped.

Is that the end of it or is there anything on the roadmap to fulfil the requirement.

We are trying to move as many of our "agent" type services out of the host and in to priviledged daemon containers. For example, running fluentd container with mounted volume of /var/log rather than fluentd on the host.

This is impossible if you're going to always kill the daemons though as we'll end up stop {DOING X} for whatever host is in draining mode. This could be logs, host metrics, security scanning, etc.

Hi Wimnat - can you please unpack that a little more for me please. If all replica tasks have been shut down and the instance is in draining mode - then why would you not want the daemon tasks to be stopped before the instance is terminated ? Is it to collect host level metrics/logs ?

The reason we did not want to build that in was because some customers use Daemon scheduler type to run applications - hence eliminating draining altogether is a risky for them.

pavneeta on 18 Dec 2020

Yes... I'm one such user of Daemon tasks that run applications. Draining means draining.. so all tasks should be removed when draining process is requested. I don't mind that the daemon tasks are the last ones to go down but they must be removed at some point in the lifecycle.

I would suggest that a feature like that is needed, a new RFE should be submitted to request a new mode besides draining. Call it userapps_off or something like that.

nunofernandes on 18 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings