Containers-roadmap: [ECS] [request]: Prioritize Daemon task scheduling above Replica tasks

Created on 1 Aug 2019  路  9Comments  路  Source: aws/containers-roadmap

Tell us about your request
When our Replica ECS Services scale up and we launch new ECS Container Instances (from an ASG, scaling on *Reservation metrics), sometimes the Replica tasks are launched on the instance fast enough that our Daemon Services do not have a chance to launch on these new hosts. If these replicas use enough CPU/memory, there may not be room for the Daemon services to run.

For example, we have a few daemon services to collect host-level metrics and forward log files. We want these to run on every host. Periodically we see that hosts have been saturated with Replica tasks and there's no room for the Daemon tasks. This means we lack monitoring and visibility into these hosts.

Which service(s) is this request for?
This could be ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Every host should have Daemon tasks provisioned on it, reliably.

Are you currently working around this issue?
We manually go into the ECS console, review all the (possibly hundreds) of hosts to identify which one is missing Daemon services, then run "Stop Task" for one of the replica tasks on that host in order for Daemons tasks to have room to launch.

ECS Proposed

Most helpful comment

@ericdahl @CpuID Thank you so much for your valuable feedback. We are currently in the middle of scoping out the solution to this known problem.

All 9 comments

Just got bitten again by this today (has happened more than once). had to go find some tasks to kill off to make room. Then the issue is the Daemon service wouldn't "retry" quickly enough to fill the void on that host, and other tasks would get binpacked in there...

@ericdahl @CpuID Thank you so much for your valuable feedback. We are currently in the middle of scoping out the solution to this known problem.

Just ran into this again on a deploy of a daemon service (a log ingest process - filebeat), had to stop a bunch of other tasks to make room...

Is there a projected timeline for this? Or is this:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/start_task_at_launch.html

a viable alternative?

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

You'd need a wrapper to ensure the container is always running and restarted if a rudimentary health check fails among other things (like a way to insert secrets to the docker containers).

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

The ECS agent is run as a docker container as makes it onto every EC2 instance. A work around therefore would be to start your platform critical daemon processes without an ECS task definition and instead, do it in your user-data.txt.

yea its possible to start things prioritized outside of the ECS ecosystem, but you then need to reserve resources in the ECS agent, and deploys of new versions of daemon services are still a PITA (you need to replace the EC2 instances, instead of just do a daemon service deploy). a step backwards overall ;) (probably what we all did before daemon services were a thing)

I sure do wish the ecs-agent team would prioritise this as it seems pretty obvious that Daemons are important processes.

+100 - the ideal goal here

@pavneeta any updates on where this is at?

+1 for this one

+1 We had to wrestle with this on a dense ECS cluster today, would be great to see this baked in!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abby-fuller picture abby-fuller  路  3Comments

clareliguori picture clareliguori  路  3Comments

talawahtech picture talawahtech  路  3Comments

miztch picture miztch  路  3Comments

tabern picture tabern  路  3Comments