Containers-roadmap: [ECS] One task prevents from all of the other in instance to change from PENDING to RUNNING

Created on 1 Nov 2018 · 36Comments · Source: aws/containers-roadmap

Hello There.
We're facing lately a strange practice from ECS about tasks that prevent from new tasks to run in an instance.

A little about our case -
We have some tasks that need to complete their action and then exit by themselves in a stopTask command. That means we have a gracefull-shutdown process, that in sometimes make some time to complete (more than few seconds and even some long minutes).

However, when a stopTask is sent over these tasks, they do not appear anymore in the tasks at ECS console (which is fine), but they also make all other tasks in the same instance that are trying to change their state from PENDING to RUNNING.

Here is an example of one instance tasks when it happens:

Why is that behavior happens? Why one task prevent from the other to run next to it until it done? This is a bad practice in resource management (we don't use the max potential of our instances at the pending time).

The best thing will be that the stopped task will appear in the console until it really stopped in the instance, and the state changing from PENDING to RUNNING won't be affected by other tasks in the same instance.

I hope you can fix that behavior,

Thanks!

ECS Proposed

Source

Alonreznik

👍3

Most helpful comment

@Alonreznik: thanks for following up again and communicating the importance of getting this resolved. this helps us prioritize our tasks.

we don't have an ETA right now - but have identified the exact issue and have a path forward that requires changes to our scheduling system and the ecs agent. so to give you some more context. as @petderek said earlier,

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources.

so changing this behavior will be a departure from our existing way of accounting resources when we schedule tasks. considering that the current way has been in place since the beginning of ECS, the risks involved with changing this are significant as there could be subtle rippling effects in the system. we plan to explore ways to validate this change and ensure to not introduce regressions.

the original design made the trade off towards oversubscribing resources for placement by releasing resource on the instance when tasks were stopped - but the side effect of that is the behavior you are describing. additionally, now that we've added granular sigkill timeouts for containers with #1849, we can see this problem being exasperated.

so all that is to say - we're working on this issue and we will update this thread as we work towards deploying the changes.

adnxn on 11 Mar 2019

👍3

All 36 comments

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources. For example, a webserver would still need its resource pool while waiting for all connections to terminate. The alternative would be to allocate fewer resources when a task transitions to stopping, but that's not a safe assumption to make across all workloads.

Would configurable behavior help your use case? Or is it sufficient to be more clear about this behavior in the console?

petderek on 1 Nov 2018

Hey @petderek.
Thank you for your response.

I get it works like that by design. However, i wonder why one task should need to prevent all other to be running.

The best configurable behavior for us will be a task-for-task process: calculating the resources that are held by the stopping task (which is fine and reasonable), but not preventing from other tasks to be running on the instance while it has a resource to give.

In our use case, the tasks are used for a long-poll workload that should run as a service, and not for a web client. That behavior makes our instances not be filled in time and can also "stuck" our process in a new deployment, because instances are waiting for one task to end until the other tasks are allowed to run.

So the instance is actually in kind of "disabled" or "draining" status until the long workload is done (and it can take some time).

What can we do in our use case to be supported in ECS?

Thanks

Alonreznik on 4 Nov 2018

Hi!
Is there any update about that thing?
How can we make any solution or workaround in order to run the task-for-task mechanism?

Thank you in advanced!

Alonreznik on 11 Nov 2018

Hi everyone,

we are facing the exact same issue. @Alonreznik is right, one task is blocking all other tasks and in my opinion this does not make sense. Let me illustrate:

Assume we have one task with 10 GB memory reservation running on a container instance that has registered with 30 GB. The container instance shows 20 GB of RAM available and that is correct. Now this task is stopped (the ECS agent will make docker send a SIGTERM) but the container keeps running to finish its calculations (now it shows under stopped tasks as "desired status = STOPPED" and "last status = RUNNING"). The container instance will now show 30 GB available in the AWS ECS console which is nonsense, it should still be 20 GB since the container is still using resources as @petderek mentioned. Even worse, if we try to launch three new tasks with 10 GB memory reservation each, they will all be pending until the still running task transitions to "last status = STOPPED". Expected behavior would be that two out of the three tasks can launch immediately.

I hope my example was understandable, else feel free to ask.
And thanks for looking into this :)

FlorianWendel on 19 Nov 2018

Hey! As a workaround, you can set ECS_CONTAINER_STOP_TIMEOUT to a smaller number. This is used to configure 'Time to wait for the container to exit normally before being forcibly killed'. By default, it is set as 30s. More information can be found here. I marked this issue as a feature request and we will work on it soon.

yumex93 on 19 Nov 2018

👍2

Hi @yumex93
Thank you for your response.
We will be very happy to have that feature as soon as it will be out :)

About your workaround - in most cases, we need our dockers to make a graceful shutdown before they die. Therefore, decreasing the ECS_CONTAINER_STOP_TIMEOUT will cause our workers to be killed before the shutdown is completed. Therefore the feature is more than needed :)

Thank you again for your help, we're waiting for updates about it.

Alon

Alonreznik on 20 Nov 2018

@Alonreznik , @yumex93 We have the same situation, some workers even take a few hours to complete their task and we've leveraged the ECS_CONTAINER_STOP_TIMEOUT to shut those down gracefully as well. Since ECS differentiates between a "desired status" and a "last status" for tasks, I believe it should be possible to handle tasks in the process of shutting down a bit better than how it works today. For illustration of what I mean, see this screenshot:

ecs-bug

The tasks are still running and still consume resources, but the container instance does not seem to keep track of those resources. If this is more than just confusing display, I expect it to cause issues, e.g. like the one above.

FlorianWendel on 20 Nov 2018

👍1

Hi @yumex93, Any update with that issue?

Thanks

Alon

Alonreznik on 17 Dec 2018

We are aware of this issue, and we are working on prioritizing it. We will keep this issue open to track this issue, and provide update when we have more solid plans.

yhlee-aws on 19 Dec 2018

Hi @yunhee-l.
Thank you for your last response.
We're still facing this issue, which demands from us to luanch more servers than we need in our deployments, and makes our workloads stuck.
Any update in that case?

Thanks

Alonreznik on 16 Jan 2019

👍1

Hi @yunhee-l @FlorianWendel
any update?

Alonreznik on 27 Feb 2019

We don't have any new updates at this point. We will update when we have more solid plans.

yhlee-aws on 27 Feb 2019

Hi,

Just wanted to add our experience with this with the hopes that it can be bumped in priority.

We need to run tasks that can be long running. With this behaviour as it stands it essentially locks up the ec2 instance so that it cannot take any more tasks until the first task has shut down (which could be a few hours) It wouldn't quite so bad if ecs marked the host as unusable and placed tasks on other hosts but it doesn't, it still sends them to the host that cannot start them. This has the potential to cause us service outage in that we cannot create tasks to handle workload (we tell the service to create tasks but it can't due to the lock up)

Thanks.

tomotway on 8 Mar 2019

👍2

@petderek @yumex93
This is something really makes us pay more than the resource we need each deployment. As you can see, there is more than one user who suffers that kind of basic designing.

Do you have any ETA for implementing it or deploying it? This is a real blocker for our ongoing processes.

Thank you

Alon

Alonreznik on 10 Mar 2019

@Alonreznik: thanks for following up again and communicating the importance of getting this resolved. this helps us prioritize our tasks.

This is working by design. Our scheduler assumes that while a task is stopping (or exiting gracefully) it will still use its requested resources.

so all that is to say - we're working on this issue and we will update this thread as we work towards deploying the changes.

adnxn on 11 Mar 2019

👍3

@adnxn
Thank you for your detailed explanation. It helps very much understanding the context of the situation.

We off course get this is something that is built in the design and we accept it.

However, I assume that our intentions are not for this radical change in the core system (which is great!!). Our request is based on the ecs-agent assumption of all of the resources must be released in the instance from the last tasks, and our request is just to handle it by task (and also have some indication the task is still running on the instance backward after it got the SIGTERM).

As it looks today, the resource handling and releasing are based on the entire instance, and not on the tasks that run over the instance. So if a task releases its resources, the ecs-agent should allow scheduling these resources for new tasks (it they stand on the resource requirements).

Thank you for your help!
Much appreciated!

Please keep us posted,

Alon

Alonreznik on 14 Mar 2019

Hello,
we are affected with the exactly same issue. ECS service deploying long-poll workers with stopTimeout set to 2 hours. Task in running state with desired status stopped block all new tasks scheduled on the same instance even there are free resources available.

Adding a new instances to the cluster helped US to workaround this situation, but it can be really costly if there are multiple deploys each day.

Are there any new updates about this issues? or possible workarounds.

It could definitely be solved by removing the long poll service and switch it to just calling ECS RunTask (process one job and terminate) without waiting for the result. But it would require more changes in our application architecture and also it would be more tightly coupled to ECS.

thanks
Martin

Halama on 23 May 2019

Hi @coultn @adnxn
Any update or ETA about that?

Thank you

Alon

Alonreznik on 24 Aug 2019

Hi Guys.
Can somebody take a look about that?
This is harming our business because we have a problem with deploying new versions to our Prod. This is really problemtaic, and it shades a dark light about continuing using the ECS in our side.

Thanks

Alonreznik on 6 Oct 2019

@coultn

Alonreznik on 10 Oct 2019

Hi, thank you for your feedback on this issue. We are aware of this behavior and are researching solutions. We will keep the github issue up to date as the status changes.

coultn on 10 Oct 2019

👍1

Hi @coultn .
Thanks for your reply.

We must say this is something prevents our workloads to grow accordingly to our tasks, and there are situations this behavior actually stuck our production servers. Again, something that can be a no-go (or no-continue in our case) using ECS in prod.

For example, you can see a typical production workloads desire/running gap.

The Green layer is the gap between the desired and the running (orange layer) tasks. The blue is the PENDING tasks in the cluster. You can see a constant gap between these two parameters. No deployment was made today and this is something we're encountering in scaling up mechanism.

Think about the situation we're encountering. We have new tasks in our queue (SQS), and therefore we're asking from the ECS to run new tasks (means - desire tasks increasing).
Each workload is a task in the ECS, and all of them split between the servers.
When we have some workload take more than some time to complete (and there are many of them because we're asking for the workload to end it's task before it ends and then die) one workload blocks the entire instance to get new one workload, even there are free resources in the instance.

The ECS agent schedule new workloads to that instance, and then hits the one task that is still working. For the agent - he made its job - he scheduled new tasks. But the tasks are stuck in PENDING state, for hours in some cases, makes this instance to be unusable because they're just not working yet. Now think about, that you need to raise the more 100 tasks in some hours to complete a quick workloads in the line, and you have 5-6 instances with one task blocks each one, and it becomes to be a mass.

We also must say we encounter this in the last year only, after some upgrade of the agent a year or year and a half ago.

We need every day to ask for more instances in our workloads in order to open the block. This is not how a production service in AWS should be maintained, and we're facing that again and again in this case, every day.

Please help us to continue using ECS as our production orchestrator. We love this product and want it to succeed, but as it seems, it doesn't fit to long-working tasks.

Your help of rushing this in your team will be kind,

Thank you

Alon

Alonreznik on 31 Oct 2019

👍1

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Halama on 31 Oct 2019

I've discussed this with Nathan he told me that they plan fixing this, but unfortunately without a quick fix. We have similar issues with deployment and scaling and due to it lot of unnecessary rolls of new instances.

Meanwhile we are experimenting with EKS (also fo multi-cloud deployment) where this issue isn't present.

Hi @Halama .

Thanks for the reply and the update.

I can get this is not something that can be quick to solve, but meanwhile, ECS team can provide workarounds, such as placing-method of binpack and the newest instances, or determine the time task can be on PENDING state before it tries on the new instances. This issue is not getting any response due to many users encountering that. It is open more than a year and they're can't send any reasonable ETA (even 3 months is good for us). It was on "researching" just in the last week.

Can you please share about your migration process to EKS from ECS?

Thanks again

Alon

Alonreznik on 31 Oct 2019

@Alonreznik, would you be willing to share more details about your specific configuration via email? ncoult AT amazon.com.
Thanks

coultn on 31 Oct 2019

hi @coultn.
Thanks for your option.
Just sent a detailed email with our architecture and configuration, and the problem we're facing.

Thanks, and appreciated!

Alon

Alonreznik on 5 Nov 2019

Hi Guys.
We've actually faced this again today at the prod, where there was a huge gap for almost an hour between the desired task and the running task. This is not something we can still rely on, and it let a big shade on using ECS on production for our main system, because it means the PENDING tasks are just "stucking" the entire instance and we need to refresh the entire task placement again every time it happens.

In the next graph, you can see the gaps:

The green is the desired task number per minute, and the orange is actual number of the running tasks. At some point in time, there were more than 70 tasks asked for starting and got stuck because of one (!!!) task that is still running on each instance. We don't also have the ability to set new tasks on new instances only, so nothing we can do about that.

This is a big lack of the service, and makes it unstable in our terms.
Please fix it as soon as possible,

Alon

Alonreznik on 30 Jan 2020

@Alonreznik Yes, this is a known issue (as we have discussed previously). Here is one solution that we are considering implementing soon:

Introduce a new subject field for the ECS cluster query language, called stoppingTasksCount. This would be similar to the existing field called runningTasksCount, except that it would be the count of tasks on an instance that are in the STOPPING state.
With this new subject field, you could use a placement constraint of this form:

"placementConstraints": [
    {
        "expression": "stoppingTasksCount == 0",
        "type": "memberOf"
    }
]

This placement constraint would prevent the task from being placed on any instance that has a stopping task. So, new tasks would only be placed on instances with no stopping tasks. Please let us know if you have questions/comments about this proposed solution.

coultn on 30 Jan 2020

Hi @coultn Thanks for the update on this. How would this work in this example scenario:

I have 1 ecs cluster with 10 ec2 instances, ECS_CONTAINER_STOP_TIMEOUT is set to 6 hours. I have a service with 10 tasks that are distributed evenly over all 10 nodes (1 task per node). I tell the service to scale in to 0 tasks but the tasks are currently busy so they stay in a stopping state until they have finished their work (or 6 hours expires). I then try to create another service on the same cluster while these long running tasks are still completing but, despite there potentially being lots of cpu and memory available, the other service is not able to start? In my mind this still means the whole cluster is essentially locked for up to 6 hours. Or am I misunderstanding the proposed solution here?

I don't know the full details of how ECS works but it seems more sensible to me to base the allocation of tasks to the hosts based on the actual free cpu and memory of that host (total cpu / ram minus (tasks running + tasks stopping)) If there are tasks stopping on a host you should still be able to start tasks on that host if there is sufficient resource, when the tasks finally stop you tell ECS the cpu / ram resource has become available.

tom22222 on 30 Jan 2020

👍2

@tom22222 In your specific scenario, I would recommend (1) sizing the instances as close as possible to the task size (since you are only running 1 task per instance) and (2) enabling cluster auto scaling - this will cause your cluster to scale out to accommodate the new service automatically, assuming the placement constraint feature is implemented as I proposed above.

However, we may also implement additional changes to account for stopping tasks’ resources as you proposed, but at the present time the placement constraint approach will be quicker to launch. The reason for this is that the idea you proposed above would change the default behavior of task placement for all customers, even if they are not having the problem described in this issue. We typically approach changing default behavior for all customers more slowly than a feature that doesn’t change default behavior but which can be opted into for those customers that need it (such as the placement constraint approach I proposed above).

coultn on 30 Jan 2020

👍1

Hi @coultn

Thanks for the response. So on your suggestions:

I would recommend (1) sizing the instances as close as possible to the task size (since you are only running 1 task per instance)

I just gave that as an example of how 1 service could be configured, we could be running lots of other services with tasks of various sizes on the cluster at the same time and the issue would still be the same

(2) enabling cluster auto scaling

Yes, I agree, using the new Capacity Provider stuff along with your proposed change here could work and should allow for new services to be started although the drawbacks here would be that it would take some time to spin up additional ec2 instances to handle the new task requirements and all of the existing ec2 instances are still locked despite the fact they have plenty of cpu / ram available (so we will therefore have poor scaling performance across the whole cluster and we are wasting resources). Don't get me wrong, I do think this is better than what we have now but it feels more of a workaround with some drawbacks.

We typically approach changing default behavior for all customers more slowly than a feature that doesn’t change default behavior but which can be opted into for those customers that need it

I completely agree there would be a risk in changing the default behaviour so I would suggest you have it as another option where the default is the current behaviour .....

ECS_CONTAINER_STOPPING_TASK_PROCESSING=completed (this would be the current way it works)
ECS_CONTAINER_STOPPING_TASK_PROCESSING=immediate (this would allow other tasks to be started even if there were tasks in a stopping state)

(I'm sure you can think of better option names but hopefully you get my point)

Thanks

tom22222 on 31 Jan 2020

👍2

The issue also affects DEAMON services. Daemon containers doesn't start on new instance in case there are RUNNING daemon containers in desired state STOPPED on DRAINING instances.
I think it is because the Automatic Desired Count of DAEMON service also counts the RUNNING container in Desired Status STOPPED.

Halama on 26 Apr 2020

I'm also trying to take advantage of a long stopTimeout. Preventing new tasks from running only makes sense when the stopTimeout is only a few seconds. Now that stopTimeout can be very long, having stuck ECS hosts is silly.

I propose that tasks should not deregister from the service until after the container has actually exited. The task should enter a new state called STOPPING once the initial stop has been sent to the container. In this way the resources are still tracked and scheduling new tasks can be assigned appropriately. After the container exits, then the task can move into STOPPED and deregistered.