Containers-roadmap: [ECS] [request]: Control which containers are terminated on scale in

Created on 21 Jan 2019 · 26Comments · Source: aws/containers-roadmap

We use ECS for auto-scaling build agents for buildkite.com. We are using custom metrics and several Lambdas for scaling the ECS service that runs our agent based on pending CI jobs. Presently when a we scale in the DesiredCount on the service, it seems like it's random which running containers get killed. It would be great to have more control over this, either a customizable timeframe to wait for containers to stop after being signaled or something similar to EC2 Lifecycle Hooks.

We're presently working around this by handling termination as gracefully as possible, but it often means cancelling an in-flight CI build, which we'd prefer not to do if other idle containers could be selected.

ECS Proposed

Source

lox

👍35

Most helpful comment

The stop timeout does provide an ECS equivalent for EC2's termination lifecycle hook. However ECS is still missing an equivalent of EC2's instance protection, which would allow solving exactly the problem in this issue's title.

Using EC2 instance protection, you can mark some instances in an autoscaling group as protected from scale in. When scaling in, EC2 will only consider instances without scaling protection enabled for termination. By manipulating the instance protection flag, an application can control exactly which EC2 instances are terminated during scale-in. If ECS would add an equivalent "task protection" flag for ECS tasks, problems like the one @lox described would have a straightforward solution. You'd simply set the protection flag to on for tasks that are busy, and turn it off when a task is idle. When an ECS service was told to scale in, it would only be allowed to kill tasks with protection turned off.

I've been wrestling with a similar problem recently, and it would be very helpful if AWS would add a "task protection" feature.

ajenkins-cargometrics on 30 Jul 2019

👍22

All 26 comments

What we do is using StepScaling (instead of SimpleScaling), because once the ASG triggers the termination process, it is not blocking any further scaling activities. And in addition we've a lifecycle hook which sets the instance to draining (in ECS) and waits until all tasks are gone (or the timeout). It's based on this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

pgarbe on 21 Jan 2019

Thanks @pgarbe, I'm not sure how that helps! I'm talking about scaling in ECS tasks when a ECS service gets a decreased DesiredCount.

lox on 25 Jan 2019

What I understood is, that you want to keep the EC2 hosts running as long as some tasks run on it, right? Even when this instance is marked to be terminated by the AutoScaling Group. Actually, you can't really control which EC2 instance gets terminated. But, with the lifecycle hook I mentioned above, you can delay the termination until all tasks are gone.

pgarbe on 29 Jan 2019

Apologies if I've done a bad job of explaining myself @pgarbe, that is not at all what I mean. The autoscaling I am talking about is the autoscaling of ECS Tasks in an ECS Service, not the EC2 instances underneath them. As you say, there are a heap of tools for controlling the scale in and out of the underlying instances, but what I'm after are similar mechanisms for the ECS services.

Imagine you have a 100 "jobs" that need processing, and you run "agents" to process those jobs as ECS tasks in an ECS service which is controlled by auto-scaling the DesiredCount. The specific problem I am trying to solve is how to intelligently scale in the ECS tasks that aren't running jobs. Currently setting DesiredCount on the ECS Service seems to basically pick Tasks at random to kill. I would like some control (like lifecycle hooks provides for ECS) to make sure that Tasks finish their work before being randomly terminated.

lox on 1 Feb 2019

Ok, got it. Unfortunately, in that case, I can't help you much.

pgarbe on 4 Feb 2019

I have this same issue. I am using _Target Tracking_ as my scaling policy and is tracking CPU Utillization. So whenever it does a scale-in, it kills the tasks for that service even if there are clients connected to it. I would love to know if there's a way to implement some kind of a lifecycle hook or a draining status so it will only kill the task when all connections are drained.

travis-south on 27 Mar 2019

👍2

I think there are two things in ECS which can help for connection/job draining before the ECS task stopped.

__ELB connection draining__: If ECS service connects to a ELB target group, ECS will ensure the target is drained in ELB before stop the task.
__Task stopTimeout__: ECS won't directly hard kill the container. Instead, it will send out stop signal and wait for a configurable amount of time before forcefully kill it. The application could gracefully drain in-flight jobs during the shutdown process.

Are they able to handle your case? @lox @travis-south

wbingli on 28 Mar 2019

Thanks @wbingli, Is there an option for ELB connection draining for ALBs? I can't seem to find it.

As for the _stopTimeout_, I'll try this and will give feedback.

Thanks.

travis-south on 29 Mar 2019

Yeah, stopTimeout looks interesting for my usecase too! I was in the process of moving away from Services to ad-hoc Tasks, but that might work.

lox on 29 Mar 2019

I don't think _stopTimeout_ is in CloudFormation already, or am I missing something? 😃

travis-south on 29 Mar 2019

I certainly hadn't heard of it before!

lox on 29 Mar 2019

@lox @travis-south The documentation says startTimeout and stopTimeout is only available for tasks using Fargate in us-east-2. That's pretty narrow availability! 😄

This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only and the task or service requires platform version 1.3.0 or later.

whereisaaron on 29 Mar 2019

I see, well, I think I'll resort to ECS_CONTAINER_STOP_TIMEOUT for now to test it.

travis-south on 29 Mar 2019

@travis-south I think here is the document to configure ELB connection draining, ELB Deregistration Delay. There is no need for a configuration on ECS service side, it will always respect ELB target draining and stop the task once target draining completed.

The stopTimeout feature is pretty new, it's launched on Mar 7.

As for the availability, it should be available to all regions if using EC2 launch type, agent version 1.26.0+ required. The document is kind of misleading to say "This parameter is available for tasks using the Fargate launch type in the Ohio (us-east-2) region only", it actually means "For tasks using Fargate launch type, it's only available in Ohio (us-east-2) region only and requires platform version 1.3.0 or later".

wbingli on 29 Mar 2019

@wbingli thanks for the explanation. I'll try this. At the moment, my _deregistration delay_ is 10 secs, i'll try to increase this and see what happens.

travis-south on 1 Apr 2019

Hi everyone, the stopTimeout parameter is available for both ECS and Fargate task definitions. It controls how long the delay is between SIGTERM and SIGKILL. Additionally, if you combine this with the container ordering feature (also available on both ECS and Fargate), you can control the order of termination of your containers, and the time each container is allowed to take to shut down.

We are in the process of updating ECS/Fargate and CloudFormation docs to reflect the fact that these features are available in all regions where those services are available.

coultn on 16 Apr 2019

How would one disable SIGKILL entirely @coultn? Sometimes tasks might take 30+ minutes to finish.

lox on 17 Apr 2019

You can't disable SIGKILL entirely, but you can set the value to a very large number (on ECS; there is a 2 minute limit on Fargate).

coultn on 17 Apr 2019

👍1

I tried increasing my _deregistration delay_ to 100 secs and it made thing worst for my case. I receive a lot of 5xx errors during deployments.

travis-south on 17 Apr 2019

update: the stop timeout parameter is now documented in CloudFormation (see release history here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ReleaseHistory.html). However, the original issue is about controlling which tasks get selected for termination when a service is scaling in due to a scaling policy action. Per-container stop timeouts can help with that but won't provide a complete solution.

coultn on 20 Apr 2019

This basically brings things up to parity with lifecycle hooks on EC2, so I'd say this pretty much addresses my original issue. Happy to close this out, thanks for your help @coultn.

lox on 20 Apr 2019

I've been wrestling with a similar problem recently, and it would be very helpful if AWS would add a "task protection" feature.

ajenkins-cargometrics on 30 Jul 2019

👍22

is there a maximum value for stopTimeout ?

MaerF0x0 on 26 Nov 2019

In my case I preffered stopTimeout to be 0 so so the container will be killed immeadetly, but apperently the minimum value is 2 seconds.

What is the reason of not allowing less than 2 seconds values?
Where I can see documentation on the limits?

shmulikd9 on 11 May 2020

This would a great value addition since ECS-EC2 tasks are usually run for processes that need to be always up and running. And in scenarios where the process cannot be stopped for hour(s) due to tasks running from the time of SIGTERM being called, this can mean incompleted tasks. Would have been wonderful to see a managed solution to this, instead of us having to build an entire architecture around this and maintain the lifecycle of the process ourselves.

kaushikthedeveloper on 28 Sep 2020

@ajenkins-cargometrics I 100% agree with your suggestion, ECS tasks for a job should able to enable "task protection".
Can I ask that how you could use that "instance protection" for EC2? Because in ECS you cannot know where the task will be placed by ECS?
Or you are enabling "instance protection" inside of the docker container with AWS CLI?