Hello,
Having the ability to spread out containers over a cluster as best as possible would be awesome for HA.
Currently, it seems that ECS will allocate all tasks to a random instance and sometimes puts all of a specific task definition in one instance. If this instance were ever to fail, which we all know they do more often than not in the cloud, there are a few minutes downtime. ECS takes some time to realize that it needs to spin up the failed tasks on a different instance, plus the time to pull the image and run it on that other instance, if its not pulled already, are added time to an outage.
If services are given an option, or maybe even natively, have their tasks distributed over the available instances, this would greatly increase HA for those services.
The scheduler is already supposed to spread tasks for availability. According to the doc, the choice of instance is done this way :
So this should do the trick but I agree with @djenriquez that we often end up in a situation where instances have no tasks running on them while some have 3 or 4.
This might just be a result of instances coming in and out of the cluster since the logic is only done when the task first runs. It would be nice to encourage the spread of containers across newly arriving instances in the cluster while maintaining availability
One work around is to temporarily increase the container count and then lower it back to normal, I assume the left over containers would be setup in the most spread-out way across the cluster.
Or are we missing something about the way containers are scheduled?
Nope, you're right @bpascard, my request was unclear. What I was asking for was actually what you described, a feature where ECS will rebalance whenever the state of the cluster changes.
Also, according to your docs, it appears ECS sorts by AZs and ends there. I believe that it should distribute amongst instances within the AZs to handle HA from instance failure as well.
It does distribute across instances within the AZ (see the 3rd point). But it only does so once, when the task runs, and it doesn't follow the evolution of your cluster.
What you are talking about is referenced in issue #225 .
I agree "smarter allocation" is key but there is a difference in the two issues where #225 is focused solely on resource allocation and the purpose of this issue is service availability.
Also, the third point does not take into account task-definition, rather simply which instances in the AZ have fewer running tasks. Even if a second or third instance has no running tasks, and the first one does, it would be good to try to run the task on the first instance to spread the service HA.
The doc says it "favors instances with the fewest number of running tasks for this service" so it does take into account the task definition as it spreads the containers of the same service across instances. It's just not done dynamically after you change the cluster size which would be good to have.
Ah, gotcha, thanks for clarifying @bpascard
you can confuse the agent though by making an instance fail, it doesnt replace dead containers (in the 45 mins i watched).
try it on a dev cluster, goto an instance and do sudo shutdown and watch as 1/N of your containers are shutdown and not put on another instance.
@MaerF0x0 well thats a problem haha.
Just following up, any updates on this feature? Our main goal is increasing HA.
@djenriquez you'll probably get their canned response of AWS does not comment on what may or may not be built in the future, though the thing I reported is definitely a bug.
Any updates on this feature request?
Because rebalance doesn't exist, we're forced to run 3 instances at minimum to maintain a 1-server failure availability. We are literally paying for an instance that does nothing just to enable HA, for every cluster; a bit troublesome.
Hi all, can we get any information on this feature request? Services being HA is a non-functional requirement that, in my opinion, should be handled with the highest priority. If I can't guarantee service uptime with ECS, it defeats the whole purpose of me using it for production.
Please advise.
Still looking forward to this feature. I've just noticed that 5 out of my 20 instances are not running any services, while the other 15 are running very hot. What's my actionable here?
Same problem as @mikhail , especially with ALB now live we can have M containers running on N nodes (M > N) but often one node would get 1 container while most others have 3 for example.
Also an issue for us in our production work loads. We've actually had situations where we're not in HA w/ 2 containers because they ended up on the same box.
I'm pretty sure this is the steps to reproduce:
My best workarounds are to either A) Scale down and then back up services (good for asynchronous workers) or B) Add a large number of instances (ex: double your cluster), scale the services up a lot (they will balance across the new instances) and then deregister the old instances + scale down to the right size. Both are either sketchy or labor intensive.
Despite all the "bugfixes" released in the past 8 months, this is our highest risk at the moment and why we're looking to move away to a Kubernetes option.
Routinely we find 2 of 2 instances of the same task running on 1 ECS Instance and thus not actually giving us the HA we need. We have no way of enforcing HA of tasks across multiple instances, leaving us with no option but to move away. Its a shame.
+1. We frequently see many tasks stacked on one instance while others in the fleet lie completely unused. This leads to degraded performance and spurious alerts as the overloaded instance hits high CPU or memory usage thresholds.
Very big +1 here.
Werner announced at re:Invent that ECS will support customizable task placement policies. We will update this issue when it is delivered.
Customizable task placement policies are available now. @cbbarclay wrote a blog post describing how to use it, and more detailed information is available in the ECS developer guide.
I'm going to close this in favor of https://github.com/aws/amazon-ecs-agent/issues/225, which is still tracking the re-balance feature request.
Most helpful comment
Hi all, can we get any information on this feature request? Services being HA is a non-functional requirement that, in my opinion, should be handled with the highest priority. If I can't guarantee service uptime with ECS, it defeats the whole purpose of me using it for production.
Please advise.