Request for comment: please share your thoughts on this proposed improvement to ECS!
With this improvement, ECS will automate instance and task draining.
Customers can opt-in to automated instance draining for each of their clusters, using the AWS CLI, API, SDK, Cloud Formation, or Console.
For ECS cluster instances that are members of an EC2 Auto Scaling Group (ASG), ECS will set the instance to the DRAINING state whenever the ASG initiates termination of the instance. (The existing behavior of DRAINING is that ECS prevents new tasks from being scheduled for placement on the container instance. Service tasks on the draining container instance that are in the PENDING state are stopped immediately. If there are container instances in the cluster that are available, replacement service tasks are started on them. Service tasks on the container instance that are in the RUNNING state are stopped and replaced according to the service's deployment configuration parameters, minimumHealthyPercent and maximumPercent.)
ECS will prevent the container instance from terminating until all service tasks on the instance have stopped (up to a maximum of 48 hours, based on how ASG instance lifecycle hooks work).
The functionality is similar to what was published in this blog. The main difference is that ECS will automate it for you.
Yes, this sounds great. Although we can manage removing instances from an ASG, this is painful process to do, especially if the ASG tries to re-balance itself across AZs (when terminating instances in bulk - typically during scale down actions) if the placement policies are set up to do that.
This would be great. I have a slightly gnarly Lambda function triggered by autoscaling termination hooks that loops to SNS over and over until the instance no longer has any tasks running on it and, while it works fine, always makes me a little uncomfortable every time I look at it.
Probably worth raising as a separate feature request but ,in a slightly related thing, I'd also love to see daemonset services be the last to be evicted from a container instance. Right now our log shipper service runs as a daemonset and is one of the first things to be stopped when a container instance is drained, meaning logs for the tasks that are being connection drained from the ALB are now dropped.
We actually use a slightly-modified version of the Lambda solution to do this, would love to see this become standard functionality.
I think we used the blog post to set it up. There were some bugs with the solution that we had to fix. It has been stable and works well, but would be great to have ECS manage this instead of the needed Lambda bolt-on solution.
Take it one step further. Imagine being able to drain instances from Fargate into an ECS instance. This unlocks a number of different opportunities for both cost and density stories.
We have had to implement our own draining logic since we run a handful of "system" containers that collect metrics, logs, etc. from the host. Without some kind of support to kill these containers last, the native draining mode isn't usable for us.
Here are my observations:
ECS service (in homemade green/blue deployment scheme) runs on EC2 Spot instance and has 50-200 policy. A dynamic port is being registered in the target group. Once I get spot termination notice (120 seconds before instance termination) - I run lambda (with is heavily modified lambda covered in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/) that starts draining for the particular instance (with dynamic port). Because the deregistration delay is set to 30 seconds, draining happens and then (in 30 seconds) I get the same ECS server process running on the same EC2 sport machine again and it's registered in the same target group. Obviously, setting the whole EC2 node to draining state will help to solve the issue for this particular ECS service.
ECS service with 100-200 policy. 3 nodes on EC2 spots. When spot node stops - I cannot simply set the node into draining state because service's instance will not be deregistered from the target group - I need to force deregister it using Lambda. This "force deregister" is not good because it does not drain connections. Ideally, I'd like to have something called "force draining" - draining should happen when this scenario happens.
Also, monitoring containers (being run using DAEMON service type) should leave the instance last.
I also would like to see logging/agents able to run in ECS with higher priority so they are evicted last or ignored from draining. Similar to priorityClassName in Kubernetes and critical-pods
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
Thanks everyone for your comments! Regarding the scheduling of DAEMON services, that is being tracked under a separate issue: https://github.com/aws/containers-roadmap/issues/128
I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.
We wrote the lambda which does the exact thing: https://github.com/getsocial-rnd/ecs-drain-lambda
Supports the Spot Instance Interruption Notice and Auto Scaling Lifecycle Terminate Hooks.
And yes, this is a complete rewrite of aws-samples/ecs-cid-sample but with more features.
The functionality is similar to what was published in this blog
I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.
+1
We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect autoscaling:EC2_INSTANCE_TERMINATING event for spot instances within this mixed group.
I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.
+1
We recently migrated to mixed ASG per this blog, and also using the lifecycle hook to set ECS host to DRAINING when ASG scales down an instance. My big question right now is will ASG able to detect
autoscaling:EC2_INSTANCE_TERMINATINGevent for spot instances within this mixed group.
The answer is yes - it will handle both ASG lifecycle hooks and Spot termination notices.
This is very useful. One additional comment - the ability to set a Termination Policy to the ASG such that it prioritizes termination of EC2 instances that have no tasks running. Even tough setting the instance to DRAINING ensures ECS to schedule a new task to meet the service Min Healthy Percent / Desired Count, stopping and starting a new task causes retries / timeouts. Moreover, freshly started tasks may not perform as well as old ones due warm local caches, JIT compilation (for JVM), etc.
Yes! This is my most desired feature for ASG and ECS.
Would that DRAINING instance prevents other scaling activities on the ASG ?
Say, the ASG is scaling down as the desiredCount has decreased, a lifecycle termination hook is triggered and part of that process is to DRAIN the ECS container instance. The instance state, in the ASG is TERMINATING:WAIT, and it waits until the instance has drained. However, if there is a new change and there is a new scale out rule, the ASG would not scale out as the lifecycle hook is not yet terminated. Am I right ?
Any update of this feature? Auto-scaling does not help enough in case of ECS cluster.
This item should also consider service discover / CloudMap so that tasks are de-registered from being discovered through CloudMap when the instance is being drained
Hi @coultn ,
We've recently run in to this, we've sticky-taped a workaround together using a Lambda and step function. Having done so I'm now keen to find out when the feature is likely to become available, to try and minimise the lifetime of our homegrown approach.
I know you probably aren't able to give dates, but could you perhaps give us an indication of what stage this work is at?
Cheers!
How would this play out with cluster Capacity Providers? Is the plan (if any) to add this as a Capacity Provider feature, an Autoscaling Group feature, or something else?
@briancurt Have you taken a look at the "managed termination protection" feature when using capacity providers/managed scaling?
@joshuabaird Yeah, I'm using it in fact.
To be honest I asked because Capacity Providers have been kinda underwhelming. In theory the concept around them is great, but I found the implementation very restrictive and overly complicated. So if the automatic instance draining will be an ASG feature (without the need for Capacity Providers in between) I may consider moving out of Capacity Providers and back into managing the cluster scaling myself. With automatic instance draining, doing that oneself would be much easier.
Ok - yeah, I agree that the CP implementation seems a bit overly complicated. The main reason we're considering using it is to allow us to scale based on network resources (ENI), etc.
Note: now that https://github.com/aws/containers-roadmap/issues/128 is live, the Python lambda in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/ will likely always wait for the 900 second lifecycle hook timeout, rather than completing the ASG lifecycle hook once all tasks are terminated. Specifically since DAEMON tasks don't exit on DRAINING anymore.
@pavneeta I assume you are already onto it, based on assigning yourself yesterday :)
Note: now that #128 is live, the Python lambda in https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/ will likely always wait for the 900 second lifecycle hook timeout, rather than completing the ASG lifecycle hook once all tasks are terminated. Specifically since
DAEMONtasks don't exit onDRAININGanymore.
@CpuID, This doesn't seem to be the case, since according to this comment https://github.com/aws/containers-roadmap/issues/128#issuecomment-720262219 DAEMON will be stopped on DRAINING after all other tasks stopped.
ECS will now ensure that Daemon tasks are the last task to drain from an Instance - this will allow the monitoring daemons to pick up trailing logs or metrics
Aha, I read it wrong. Thanks for clarifying :)
On Wed, 2 Dec 2020 at 9:08 pm, Taras notifications@github.com wrote:
Note: now that #128 https://github.com/aws/containers-roadmap/issues/128
is live, the Python lambda in
https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/
will likely always wait for the 900 second lifecycle hook timeout, rather
than completing the ASG lifecycle hook once all tasks are terminated.
Specifically since DAEMON tasks don't exit on DRAINING anymore.@CpuID https://github.com/CpuID, This doesn't seem to be the case,
since according to this comment #128 (comment)
https://github.com/aws/containers-roadmap/issues/128#issuecomment-720262219
DAEMON will be stopped on DRAINING after all other tasks stopped.ECS will now ensure that Daemon tasks are the last task to drain from an
Instance - this will allow the monitoring daemons to pick up trailing logs
or metrics—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/aws/containers-roadmap/issues/256#issuecomment-737160950,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAG7V2L5YNNETQVAOHJRLWDSSYNY7ANCNFSM4HHIJUPA
.
Most helpful comment
I'd also like to see automatic DRAINING support for spot instances, i.e. if an instance gets a spot termination notice, automatically set it to DRAINING.