Containers-roadmap: [ECS] [RFC]: Automatic management of instance draining in an ASG

Created on 20 Apr 2019 · 25Comments · Source: aws/containers-roadmap

Request for comment: please share your thoughts on this proposed improvement to ECS!

With this improvement, ECS will automate instance and task draining.

Customers can opt-in to automated instance draining for each of their clusters, using the AWS CLI, API, SDK, Cloud Formation, or Console.
For ECS cluster instances that are members of an EC2 Auto Scaling Group (ASG), ECS will set the instance to the DRAINING state whenever the ASG initiates termination of the instance. (The existing behavior of DRAINING is that ECS prevents new tasks from being scheduled for placement on the container instance. Service tasks on the draining container instance that are in the PENDING state are stopped immediately. If there are container instances in the cluster that are available, replacement service tasks are started on them. Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service's deployment configuration parameters, minimumHealthyPercent and maximumPercent.)
ECS will prevent the container instance from terminating until all service tasks on the instance have stopped (up to a maximum of 48 hours, based on how ASG instance lifecycle hooks work).

The functionality is similar to what was published in this blog. The main difference is that ECS will automate it for you.

ECS

Source

coultn

👍218 ❤9 🚀7 😕2 👀1

Most helpful comment

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

rgarcia on 24 Apr 2019

👍17

All 25 comments

Yes, this sounds great. Although we can manage removing instances from an ASG, this is painful process to do, especially if the ASG tries to re-balance itself across AZs (when terminating instances in bulk - typically during scale down actions) if the placement policies are set up to do that.

Jsmiiii on 20 Apr 2019

This would be great. I have a slightly gnarly Lambda function triggered by autoscaling termination hooks that loops to SNS over and over until the instance no longer has any tasks running on it and, while it works fine, always makes me a little uncomfortable every time I look at it.

Probably worth raising as a separate feature request but ,in a slightly related thing, I'd also love to see daemonset services be the last to be evicted from a container instance. Right now our log shipper service runs as a daemonset and is one of the first things to be stopped when a container instance is drained, meaning logs for the tasks that are being connection drained from the ALB are now dropped.

tomelliff on 20 Apr 2019

👍14

We actually use a slightly-modified version of the Lambda solution to do this, would love to see this become standard functionality.

I think we used the blog post to set it up. There were some bugs with the solution that we had to fix. It has been stable and works well, but would be great to have ECS manage this instead of the needed Lambda bolt-on solution.

ngamradt-turner on 20 Apr 2019

Take it one step further. Imagine being able to drain instances from Fargate into an ECS instance. This unlocks a number of different opportunities for both cost and density stories.

QuinnyPig on 21 Apr 2019

👍3 🚀1 ❤1

We have had to implement our own draining logic since we run a handful of "system" containers that collect metrics, logs, etc. from the host. Without some kind of support to kill these containers last, the native draining mode isn't usable for us.

rgarcia on 22 Apr 2019

👍4

Here are my observations:

ECS service (in homemade green/blue deployment scheme) runs on EC2 Spot instance and has 50-200 policy. A dynamic port is being registered in the target group. Once I get spot termination notice (120 seconds before instance termination) - I run lambda (with is heavily modified lambda covered in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/) that starts draining for the particular instance (with dynamic port). Because the deregistration delay is set to 30 seconds, draining happens and then (in 30 seconds) I get the same ECS server process running on the same EC2 sport machine again and it's registered in the same target group. Obviously, setting the whole EC2 node to draining state will help to solve the issue for this particular ECS service.
ECS service with 100-200 policy. 3 nodes on EC2 spots. When spot node stops - I cannot simply set the node into draining state because service's instance will not be deregistered from the target group - I need to force deregister it using Lambda. This "force deregister" is not good because it does not drain connections. Ideally, I'd like to have something called "force draining" - draining should happen when this scenario happens.

Also, monitoring containers (being run using DAEMON service type) should leave the instance last.

sirocode on 22 Apr 2019

I also would like to see logging/agents able to run in ECS with higher priority so they are evicted last or ignored from draining. Similar to priorityClassName in Kubernetes and critical-pods
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

rothgar on 22 Apr 2019

👍3

Thanks everyone for your comments! Regarding the scheduling of DAEMON services, that is being tracked under a separate issue: https://github.com/aws/containers-roadmap/issues/128

coultn on 24 Apr 2019

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

rgarcia on 24 Apr 2019

👍17

We wrote the lambda which does the exact thing: https://github.com/getsocial-rnd/ecs-drain-lambda
Supports the Spot Instance Interruption Notice and Auto Scaling Lifecycle Terminate Hooks.

And yes, this is a complete rewrite of aws-samples/ecs-cid-sample but with more features.

The functionality is similar to what was published in this blog

Trane9991 on 25 Apr 2019

👍2

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect autoscaling:EC2_INSTANCE_TERMINATING event for spot instances within this mixed group.

casper-gh on 9 May 2019

I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.

+1

We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect autoscaling:EC2_INSTANCE_TERMINATING event for spot instances within this mixed group.

The answer is yes - it will handle both ASG lifecycle hooks and Spot termination notices.

coultn on 9 May 2019

🎉12 👍2

This is very useful. One additional comment - the ability to set a Termination Policy to the ASG such that it prioritizes termination of EC2 instances that have no tasks running. Even tough setting the instance to DRAINING ensures ECS to schedule a new task to meet the service Min Healthy Percent / Desired Count, stopping and starting a new task causes retries / timeouts. Moreover, freshly started tasks may not perform as well as old ones due warm local caches, JIT compilation (for JVM), etc.

guilhermesmi on 29 May 2019

👍8

Yes! This is my most desired feature for ASG and ECS.

innokentiyt on 24 Jul 2019

👍15

Would that DRAINING instance prevents other scaling activities on the ASG ?
Say, the ASG is scaling down as the desiredCount has decreased, a lifecycle termination hook is triggered and part of that process is to DRAIN the ECS container instance. The instance state, in the ASG is TERMINATING:WAIT, and it waits until the instance has drained. However, if there is a new change and there is a new scale out rule, the ASG would not scale out as the lifecycle hook is not yet terminated. Am I right ?

mfortin on 25 Oct 2019

Any update of this feature? Auto-scaling does not help enough in case of ECS cluster.

Adiii717 on 17 Mar 2020

This item should also consider service discover / CloudMap so that tasks are de-registered from being discovered through CloudMap when the instance is being drained

jespersoderlund on 16 Jun 2020

👍2

Hi @coultn ,

We've recently run in to this, we've sticky-taped a workaround together using a Lambda and step function. Having done so I'm now keen to find out when the feature is likely to become available, to try and minimise the lifetime of our homegrown approach.

I know you probably aren't able to give dates, but could you perhaps give us an indication of what stage this work is at?

Cheers!

eddgrant on 22 Jul 2020

👍4 👀2

How would this play out with cluster Capacity Providers? Is the plan (if any) to add this as a Capacity Provider feature, an Autoscaling Group feature, or something else?

briancurt on 23 Jul 2020

@briancurt Have you taken a look at the "managed termination protection" feature when using capacity providers/managed scaling?

joshuabaird on 23 Jul 2020

@joshuabaird Yeah, I'm using it in fact.

To be honest I asked because Capacity Providers have been kinda underwhelming. In theory the concept around them is great, but I found the implementation very restrictive and overly complicated. So if the automatic instance draining will be an ASG feature (without the need for Capacity Providers in between) I may consider moving out of Capacity Providers and back into managing the cluster scaling myself. With automatic instance draining, doing that oneself would be much easier.

briancurt on 23 Jul 2020

Ok - yeah, I agree that the CP implementation seems a bit overly complicated. The main reason we're considering using it is to allow us to scale based on network resources (ENI), etc.

joshuabaird on 23 Jul 2020

Note: now that https://github.com/aws/containers-roadmap/issues/128 is live, the Python lambda in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/ will likely always wait for the 900 second lifecycle hook timeout, rather than completing the ASG lifecycle hook once all tasks are terminated. Specifically since DAEMON tasks don't exit on DRAINING anymore.

@pavneeta I assume you are already onto it, based on assigning yourself yesterday :)

CpuID on 2 Dec 2020

Note: now that #128 is live, the Python lambda in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/ will likely always wait for the 900 second lifecycle hook timeout, rather than completing the ASG lifecycle hook once all tasks are terminated. Specifically since DAEMON tasks don't exit on DRAINING anymore.

@CpuID, This doesn't seem to be the case, since according to this comment https://github.com/aws/containers-roadmap/issues/128#issuecomment-720262219 DAEMON will be stopped on DRAINING after all other tasks stopped.

ECS will now ensure that Daemon tasks are the last task to drain from an Instance - this will allow the monitoring daemons to pick up trailing logs or metrics

Trane9991 on 2 Dec 2020

Aha, I read it wrong. Thanks for clarifying :)

On Wed, 2 Dec 2020 at 9:08 pm, Taras notifications@github.com wrote:

Note: now that #128 https://github.com/aws/containers-roadmap/issues/128
is live, the Python lambda in
https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
will likely always wait for the 900 second lifecycle hook timeout, rather
than completing the ASG lifecycle hook once all tasks are terminated.
Specifically since DAEMON tasks don't exit on DRAINING anymore.

@CpuID https://github.com/CpuID, This doesn't seem to be the case,
since according to this comment #128 (comment)
https://github.com/aws/containers-roadmap/issues/128#issuecomment-720262219
DAEMON will be stopped on DRAINING after all other tasks stopped.

ECS will now ensure that Daemon tasks are the last task to drain from an
Instance - this will allow the monitoring daemons to pick up trailing logs
or metrics

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/aws/containers-roadmap/issues/256#issuecomment-737160950,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAG7V2L5YNNETQVAOHJRLWDSSYNY7ANCNFSM4HHIJUPA
.