Amazon-ecs-agent: Feature request: Instance affinity/Container rebalance

Created on 11 Mar 2016 · 19Comments · Source: aws/amazon-ecs-agent

Hello,

Having the ability to spread out containers over a cluster as best as possible would be awesome for HA.

Currently, it seems that ECS will allocate all tasks to a random instance and sometimes puts all of a specific task definition in one instance. If this instance were ever to fail, which we all know they do more often than not in the cloud, there are a few minutes downtime. ECS takes some time to realize that it needs to spin up the failed tasks on a different instance, plus the time to pull the image and run it on that other instance, if its not pulled already, are added time to an outage.

If services are given an option, or maybe even natively, have their tasks distributed over the available instances, this would greatly increase HA for those services.

kinfeature request

Source

djenriquez

👍8

Most helpful comment

Hi all, can we get any information on this feature request? Services being HA is a non-functional requirement that, in my opinion, should be handled with the highest priority. If I can't guarantee service uptime with ECS, it defeats the whole purpose of me using it for production.

Please advise.

djenriquez on 11 Aug 2016

👍4 😕1

All 19 comments

The scheduler is already supposed to spread tasks for availability. According to the doc, the choice of instance is done this way :

First determine which instances can support your service's task definition in terms of ressources.
Then sort instances by fewest number of running tasks in the same Availability Zone.
Among these instances favor instances with the fewest number of running tasks for this service.

So this should do the trick but I agree with @djenriquez that we often end up in a situation where instances have no tasks running on them while some have 3 or 4.

This might just be a result of instances coming in and out of the cluster since the logic is only done when the task first runs. It would be nice to encourage the spread of containers across newly arriving instances in the cluster while maintaining availability
One work around is to temporarily increase the container count and then lower it back to normal, I assume the left over containers would be setup in the most spread-out way across the cluster.

Or are we missing something about the way containers are scheduled?

bpascard on 11 Mar 2016

Nope, you're right @bpascard, my request was unclear. What I was asking for was actually what you described, a feature where ECS will rebalance whenever the state of the cluster changes.

Also, according to your docs, it appears ECS sorts by AZs and ends there. I believe that it should distribute amongst instances within the AZs to handle HA from instance failure as well.

djenriquez on 11 Mar 2016

It does distribute across instances within the AZ (see the 3rd point). But it only does so once, when the task runs, and it doesn't follow the evolution of your cluster.
What you are talking about is referenced in issue #225 .

bpascard on 11 Mar 2016

I agree "smarter allocation" is key but there is a difference in the two issues where #225 is focused solely on resource allocation and the purpose of this issue is service availability.

Also, the third point does not take into account task-definition, rather simply which instances in the AZ have fewer running tasks. Even if a second or third instance has no running tasks, and the first one does, it would be good to try to run the task on the first instance to spread the service HA.

djenriquez on 11 Mar 2016

The doc says it "favors instances with the fewest number of running tasks for this service" so it does take into account the task definition as it spreads the containers of the same service across instances. It's just not done dynamically after you change the cluster size which would be good to have.

bpascard on 11 Mar 2016

Ah, gotcha, thanks for clarifying @bpascard

djenriquez on 11 Mar 2016

👍1

you can confuse the agent though by making an instance fail, it doesnt replace dead containers (in the 45 mins i watched).

try it on a dev cluster, goto an instance and do sudo shutdown and watch as 1/N of your containers are shutdown and not put on another instance.

MaerF0x0 on 14 Apr 2016

@MaerF0x0 well thats a problem haha.

Just following up, any updates on this feature? Our main goal is increasing HA.

djenriquez on 16 May 2016

@djenriquez you'll probably get their canned response of AWS does not comment on what may or may not be built in the future, though the thing I reported is definitely a bug.

MaerF0x0 on 16 May 2016

Any updates on this feature request?

Because rebalance doesn't exist, we're forced to run 3 instances at minimum to maintain a 1-server failure availability. We are literally paying for an instance that does nothing just to enable HA, for every cluster; a bit troublesome.

djenriquez on 12 Jul 2016

👍3 😕1

Please advise.

djenriquez on 11 Aug 2016

👍4 😕1

Still looking forward to this feature. I've just noticed that 5 out of my 20 instances are not running any services, while the other 15 are running very hot. What's my actionable here?

mikhail on 18 Oct 2016

Same problem as @mikhail , especially with ALB now live we can have M containers running on N nodes (M > N) but often one node would get 1 container while most others have 3 for example.

mcansky on 3 Nov 2016

Also an issue for us in our production work loads. We've actually had situations where we're not in HA w/ 2 containers because they ended up on the same box.

I'm pretty sure this is the steps to reproduce:

Get yourself a fairly loaded cluster (say 2 boxes running at 90% reservations)
Add a box to the cluster (now you have 2 boxes that cant take many more containers, 1 that can take alot more)
Scale up a service such that most of them can only go on 1 of the instances.
3a. Note we're already in a bad place because the ecs-agent/scheduler doesnt rebalance the containers
Continue this incremental EC2 instance + scale up of new services and you'll end up with most of your containers concentrated.

My best workarounds are to either A) Scale down and then back up services (good for asynchronous workers) or B) Add a large number of instances (ex: double your cluster), scale the services up a lot (they will balance across the new instances) and then deregister the old instances + scale down to the right size. Both are either sketchy or labor intensive.

MaerF0x0 on 3 Nov 2016

Despite all the "bugfixes" released in the past 8 months, this is our highest risk at the moment and why we're looking to move away to a Kubernetes option.

Routinely we find 2 of 2 instances of the same task running on 1 ECS Instance and thus not actually giving us the HA we need. We have no way of enforcing HA of tasks across multiple instances, leaving us with no option but to move away. Its a shame.

MaerF0x0 on 23 Nov 2016

+1. We frequently see many tasks stacked on one instance while others in the fleet lie completely unused. This leads to degraded performance and spurious alerts as the overloaded instance hits high CPU or memory usage thresholds.

egoldschmidt on 6 Dec 2016

Very big +1 here.

vpal on 16 Dec 2016

Werner announced at re:Invent that ECS will support customizable task placement policies. We will update this issue when it is delivered.

cbbarclay on 16 Dec 2016

🎉1

Customizable task placement policies are available now. @cbbarclay wrote a blog post describing how to use it, and more detailed information is available in the ECS developer guide.

I'm going to close this in favor of https://github.com/aws/amazon-ecs-agent/issues/225, which is still tracking the re-balance feature request.