Amazon-ecs-agent: Running a service once on every box in my ECS cluster

Created on 3 Apr 2017 · 10Comments · Source: aws/amazon-ecs-agent

I want to be able to run a service on every box in my ECS cluster. Prior to "Placement Constraints", the way I've achieved this was to call aws ecs start-task on boot (running a script specified inside user_data). This worked but I didn't handle the case where a container would die due to perhaps out of memory or some other reason.

My approach now is to use placement_constraints and to specify that my ECS service placement constraint has a type of distinctInstance. However, I noticed that I still need to specify a desired_count on my ECS service. If desired_count < cluster.size then I'll have boxes that don't have a running instance. My problem is I don't know how large my cluster will be.

My solution was to just set desired_count to a large enough number (e.g. 1000). Is there a better way to achieve this? I'd just like for one service to be run on all instance, exactly once.

kinfeature request scopECS Service

Source

davidvuong

👍25

Most helpful comment

Support for daemon tasks is now available as a new scheduling strategy. You can read the announcement and refer to the documentation for more details.

samuelkarp on 13 Jun 2018

🎉5

All 10 comments

you can also look Blox as daemon-scheduler.
https://blox.github.io/

hridyeshpant on 29 Apr 2017

👍4

The way we solved this problem where I work is we have a Lambda function which is triggered by ECS events that updates the desiredcount of all distinctInstance services to match the number of ECS hosts for that ECS cluster. It would be better if ECS could handle this itself though and auto-scale the service based on the number of hosts.

jamiegs on 25 Jul 2017

👍3

FWIW I created a terraform module that you can use to build the lambda function and all the underlying needed items to handle this. Much easier than blox - and less jenky than setting to 2000
https://github.com/kgirthofer/service_shuffler

Kgirthofer on 10 Aug 2017

👍3

Hi,

I think I followed the whole story about this feature from the beginning:
1-Start task at ECS host launch (still in the AWS docs)
2-Use distinctHost placement with a desired count of 1000
3-Use distinctHost placement with a desired count managed by a lambda
4-Have a real One-and-only-one-per-host-guaranteed feature in ECS

No 1: Doesn't work correctly because the task can die and so you need to manage all that yourself.
Also ECS doesn't know about it so it leads to a ton of other design/capacity issues, etc.

No 2: Seems to be used widely and often cited as a fully functional workaround but in reality your just lucky if that works. Under certain circumstance a host could be filled with other tasks and the scheduler will fail to place your sidekicks on that hosts. Leading to missing sidekicks. In short your distinctHost tasks are not guaranteed to be placed once on each host, it should happen _most_ of the time, hence it's a really bad workaround in reality.

No 3: Seems better than No 2. But it suffers from the same problems. It's entirely possible to get host with missing sidekicks if other tasks filled them up before your sidekicks.

No 4: Doesn't exists and seems like the only good solution.

We currently run them with docker auto-restart at boot on each hosts, which suffers from lack of ECS visibility/capacity management. (Can't reserve CPU units for the sidekicks in the ecs agent config either).

It means if the scheduler fills my ECS host CPU wise, the sidekicks don't have any CPU.

We also had issues after a host rebooted following an ENA driver failure, the host came back after the reboot and docker logs showed docker tried really hard to restart the sidekicks but failed because the state on disk/docker was screwed. Short: tons of errors (many attempts and state cleanups) and they never started.
(To be fair here, this host should have been recycled by the ASG anyway since it went bad, even rebooted but ASG suffers from a bug with C5/M5 were failed health checks doesn't trigger the replacement of the instance. Will be fixed by AWS but is unrelated to ECS and this issue. )

Is there something I'm missing and that I should consider while I wait for this feature?

Martin

hyksos on 23 Mar 2018

No 3: Seems better than No 2. But it suffers from the same problems. It's entirely possible to get host with missing sidekicks if other tasks filled them up before your sidekicks.

One way to avoid this is to have your Lambda also use your Lambda languages representation of the following, after it has deemed the "DaemonSet" containers launched and ready:

aws ecs put-attributes --attributes name=Ready,value=true,targetId=arn

And then in all other services/tasks have an attribute placement constraint needing Ready=true:

"placementConstraints": [
    {
        "expression": "attribute:Ready = true",
        "type": "memberOf"
    }
]

Then at least initially you will ensure the Base task(s) will properly fit before other tasks get thrown onto the instance. What happens after a task death is obviously another caveat one needs to solve. But even that should be doable from the same lambda. A container instance has suffered Base task death and there is not enough room to launch more? Set Ready:false and kill something (unless dynamically changing the attributes already makes ECS actively take things off - that I don't know).

frimik on 23 Mar 2018

No 2: Seems to be used widely and often cited as a fully functional workaround but in reality your just lucky if that works. Under certain circumstance a host could be filled with other tasks and the scheduler will fail to place your sidekicks on that hosts. Leading to missing sidekicks. In short your distinctHost tasks are not guaranteed to be placed once on each host, it should happen most of the time, hence it's a really bad workaround in reality.

Completely agreed. This is what we do and we suffer from the exact problem you describe.

cjbottaro on 23 Mar 2018

Hi all, I have created a Docker container that will update a service count to match a cluster instance count. So far it is working well. https://github.com/jamessoubry/ecs-service-count

jamessoubry on 11 May 2018

@jamessoubry that’s great! I was just looking at writing the exact same thing. All of the other approaches I’ve seen to this problem seem flawed (ECS docs) and or way too heavy for my purposes (Blox). I’ll be giving this a whirl on my cluster tomorrow to run the DataDog container agent.

AaronTorgerson on 4 Jun 2018

Looks like this is now a thing! https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler

DAEMON—The daemon scheduling strategy deploys exactly one task on each active container instance that meets all of the task placement constraints that you specify in your cluster. When using this strategy, there is no need to specify a desired number of tasks, a task placement strategy, or use Service Auto Scaling policies. For more information, see Daemon.

jlongtine on 13 Jun 2018

Support for daemon tasks is now available as a new scheduling strategy. You can read the announcement and refer to the documentation for more details.

samuelkarp on 13 Jun 2018

🎉5

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Container instance and agent not cleaned up on unclean shutdown

acmcelwee · 4Comments

Agent fails to propagate tags from EC2 Instance

PettitWesley · 5Comments

Best practice to ship ECS agent logs to Cloudwatch Logs?

melo · 5Comments

Can not acquire network metric in EC 2/Bridge mode

hayajo · 3Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments