Containers-roadmap: ECS tasks re-balancing on autoscaling.

Created on 12 Dec 2018 · 14Comments · Source: aws/containers-roadmap

There isn't any automated or easy way to re-balance tasks inside an ecs cluster when scaling events happen ( usually scaling down). It would be a very nice feature to have.
There's already a 3 year old issue opened for this feature.
https://github.com/aws/amazon-ecs-agent/issues/225

ECS Proposed

Source

parabolic

👍162 👀13

Most helpful comment

I'm confused/suprised this is not already an automatic feature. Why would I need to setup a lambda if ECS already knows about what's happening.

dfuentes77 on 5 Sep 2019

👍10

All 14 comments

Yep, we have to do the same thing. We listen to the scaling event triggers using Lambda, then tweak the scaling rules from there.

Orchestrating the orchestration.

skyzyx on 12 Dec 2018

Thank you for the feedback. The ECS team is aware of this issue, and it is under active consideration. +1's and additional details on use cases are always appreciated and will help inform our work moving forward.

@parabolic and @skyzyx what are the criteria you use to rebalance tasks? Are you aiming to binpack on as few instances as possible, or spread evenly, or something else?

coultn on 13 Dec 2018

@coultn: We try to spread evenly.

We treat our servers as cattle (not pets), and use Terraform for Infrastructure as Code. Occasionally, we will need to log into the Console, drain connections on a node, terminate it, and let auto-scaling kick-in to replace the node.

One thing that I've noticed (although the Lambda has been working so well, that I haven't tried without it for about 6 months), is that after draining connections off a node, it doesn't actually move the containers over to the remaining hosts. Nor does it move them back when the replacement host comes back up.

So we have a Lambda function configured to listen for the scale events, then trigger the rebalancing that way. It's based on the premise in this blog post: https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/

skyzyx on 13 Dec 2018

@coultn Thanks for the answer.
As for the details I use the following placement strategies:

type  = "spread"
field = "instanceId"

type  = "binpack"
field = "cpu"

And ideally the tasks should be spread evenly on nodes e.g. 1 task goes to 1 node.This setup should be rebalanced or persisted after scaling events without any intermediate logic.

parabolic on 13 Dec 2018

Our team has a similar issue. We have alerts setup for our support rotation to notify us when memory is too high on a particular instances. It occasionally happens that we are paged because the alert threshold is breached, when in fact the instances are in fact not spread i.e one instance has several ecs tasks running and another has few or none.

The policy on the ASG is to
Maintain memory MemoryReservation ~70%

The placement strategy is

"Field": "attribute:ecs.availability-zone",
"Type": "spread"
"Field": "instanceId",
"Type": "spread"

peterpod-c1 on 12 Jun 2019

I'm confused/suprised this is not already an automatic feature. Why would I need to setup a lambda if ECS already knows about what's happening.

dfuentes77 on 5 Sep 2019

👍10

Do Capacity Providers help with this?

gabegorelick on 6 Dec 2019

👍1

Do Capacity Providers help with this?

Yes, if you use managed scaling, capacity providers can help. Here’s how: if you use a target capacity of 100 for your ASG, then ECS will only scale your ASG out in the event that there are tasks you want to run that can’t be placed on the existing instances. So the instances that get added will usually have tasks placed on them immediately - unlike before, where new instances would sometimes sit empty for some time. This isn’t the only rebalancing scenario discussed on this thread, so we still have more work to do.

coultn on 6 Dec 2019

@peterpod-c1

It occasionally happens that we are paged because the alert threshold is breached, when in fact the instances are in fact not spread i.e one instance has several ecs tasks running and another has few or none.

We see the same thing regularly. The reason we find is that if many tasks are started concurrently (esp. during deployment of new code), then it dumps a lot on the same instance with the lowest utilization. By the time the tasks are up and running the instance that had the lowest utilization is now the one with the highest utilization, and sometimes over-utilized resulting in memory pressures that are too high. The workaround is too "slowly" start up new tasks, but that is in conflict with the goal to have the services up-and-running as fast as possible.

I have to add here that we see this issue only in environments where no rolling upgrades of the services are performed, i.e., those where "min healthy percent" is 0. In the environments where "min healthy percent" is greater than 0 this issue does not occur.

lawrencepit on 29 Dec 2019

Do Capacity Providers help with this?

Not quite.

In my tests I started with 4 EC2 instances, each which can take a max of 4 tasks. The distribution at the start was 4,4,4,1. After putting some load to it, it scaled to 9 EC2 instances running 22 tasks for 15+ minutes. I.e., on average it ran only 2.4 tasks per EC2 instance instead of the expected 3.5+. Running 6 EC2 instances should have been enough. After dropping the load, and all the scale in actions finished, we ended up with 6 EC2 instances running 4,4,2,1,1,1 tasks, instead of the expected 4,4,4,1. So in the end it is running at an additional cost of 2 EC2 instances.

What I expect is that scaling in works in a reverse order to scaling out, LIFO style. That way the capacity provider would remove the 2 EC2 instances.

The capacity provider used:

            "managedScaling": {
              "status": "ENABLED",
              "targetCapacity": 100,
              "minimumScalingStepSize": 1,
              "maximumScalingStepSize": 100
            },
            "managedTerminationProtection": "ENABLED"

The placement strategy used:

          {
            "field": "attribute:ecs.availability-zone",
            "type": "spread"
          },
          {
            "field": "instanceId",
            "type": "spread"
          }

lawrencepit on 3 Jan 2020

@lawrencepit Thanks for the detailed example. In this specific case, there are two implicit goals you have that are in conflict with each other. Goal 1, expressed through the instance spread placement strategy, is that you want your tasks to be spread across all available instances. Goal 2, not explicitly expressed in your configuration, is that you want to use as few instances as possible. As you point out, ECS cluster auto scaling with capacity providers cannot solve for both of these simultaneously. In fact, goal 2 is generally impossible to achieve optimally (it's known as the binpacking problem and there are no optimal solvers that can run in any reasonable amount of time). However, if you want to at least do a better job of meeting goal 2, you could remove goal 1 - don't use instance spread, but use binpack on CPU or memory instead. Binpack placement strategy isn't optimal (for reasons stated previously) but it will generally use fewer instances than instance spread.

My last comment is regarding LIFO. Placement strategies don't work that way - instead, they try to maintain the intent of the placement strategy as the service scales in and out. Using LIFO for scaling in would actually cause the tasks to be spread across fewer instances, which is not the intent of the instance spread strategy.

coultn on 3 Jan 2020

@coultn While achieving goal 2 "optimally" may not be possible, we've achieved reasonably good results by having our scaler Lambda drain least-utilized instances when it detects that there is more than one instance worth of excess capacity in the cluster. The argument could be made that this goes against goal 1, but I don't think that's the case. Your definition of that goal is that the tasks should be spread across _available_ instances. If an instance that's surplus to resource requirements is no longer considered "available", then goal 1 is still achieved. Considering that placement strategies are set at the service level and seem to be best-effort, while resource availability is more of a cluster-level concept, that strategy seems to be compatible with both goals.

idubinskiy on 10 Mar 2020

👍9

One doubt about bean pack with memory task distribution with services with capacity provider autoscaling, does it stop and start the tasks (if they are in running states?). I have some tasks which runs cronjobs, so does it restart them in-between of execution ? I have enabled managed termination on capacity provider (and my ASG).

ankit-sheth on 2 Sep 2020

This would be very much appreciated.

We have a setup that roughly represents what's described in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_alarm_autoscaling.html and in order to replace tasks before instance termination we too use the approach described in the blog post mentioned by @skyzyx https://aws.amazon.com/blogs/compute/how-to-automate-container-instance-draining-in-amazon-ecs/. We use the default autoscaling termination policy for determining instances to terminate.

Because we to not wish to maintain this intricate system, it was our intention to start using ESC cluster autoscaling. However, the "Design goal # 2: CAS should scale in (removing instances) only if it can be done without disrupting any tasks (other than daemon tasks)" is totally off point for our use case.

I just can seem to think of the use case where potentially there are e.g. 20 container-instances running with 20 tasks, and a cpu- and memory utilization of, say, 10 %, as would most certainly happen to services that have a placement strategy type spread on field instanceId.

What we would like is more like a best-effort spread (kind of what's described by @idubinskiy) and capability to choose different metrics for scaling than CapacityProviderReservation, such as CPUReservation thresholds.