Add the ability to limit the number of container starts. e.g. if a container won't start after 5 tries, stop trying to start it. Having restarts happen forever on containers with errors starting puts a large load on dockerd to deal with volumes setup for containers that failed to start.
Somewhat related to https://github.com/aws/amazon-ecs-agent/issues/674.
Somewhat, but not completely. The reason for opening this is: If a container definition is wrong, or arguments bad, or something else in error that cause the container to die upon start, ECS will continually try to restart it forever. This seems like it can make a mess of deferred volume cleanup in the devicemapper driver, and seems to potentially be causing an issue where when cleaning up from such a scenario. The docker daemon burns through all the IOPS available to clean up failed containers and eventually puts the box in a state where it can't keep up with container cleanup because of IOPS limitations. I don't know if this is something that has to be fixed in the Client, or in the Server/Scheduler of ECS. But it seems a likely issue with running tasks that needs to be addressed. Essentially, rolling bad containers to a cluster can DDOS the nodes in the cluster in ECS's current form.
I think some kind of exponential backoff pattern for restarts would be more useful here. I'm not sure that a fixed number of retries is that helpful as there may be transient failure conditions (databases unavailable/dependent services down etc.) that will eventually recover.
Logging the number of task stopped events directly to cloudwatch would probably help in terms of generating alerts when bad service configs get introduced to a cluster - we had to create our own from the cloudwatch events + lambda
We've also observed that sometimes the services-stable command returns successful in these situations even though the new task definition is essentially broken (starts and fails immediately), so we'd get less of these problems if there was a way to explicitly fail and roll back a deployment if it introduces constant restarts.
a backoff algorithm would probably work as well, just something that make the agent more intelligent when dealing with broken services/definitions, ensuring that it doesn't bring down the whole node, and potentially whole cluster, by just cramming restarts in until the nodes run out of IOPS if using EBS cleaning up all the dead containers.
It would also be pretty helpful if in the case of "flapping services" as we've been calling them, increase the cleanup of those failed containers. A very aggressive backoff could work too as has been pointed out
@rhyas
Somewhat, but not completely.
Yep, that's exactly why I didn't close this as a duplicate :smile:
We suffer from the same issue with bad images spawning repeatedly and consuming docker resources - we end up with the message below (we have a Ubuntu based images that run a very slightly modified version of the ECS agent).
This renders the ECS instance unusable until we cleanup / terminate it.
devmapper: Thin Pool has 306 free metadata blocks which is less than minimum required 307 free metadata blocks. Create more free metadata space in thin pool or use dm.min_free_space option to change behavior
We are also having this problem.
I agree that exponential backoff pattern would be best. Except when the the problem is a bad image update, in which case the service would keep trying forever.
What the recommend solution for bad image service updates? We're deploying service updates automatically using cloudformation templates, and bad update will hang the cloudformation stack for extremely long time.
Hmms... Does this mean that this issue is solved?
Circuit Breaking Logic for the Amazon ECS Service Scheduler
Thanks,
Hi @melo,
Circuit Breaking Logic for the Amazon ECS Service Scheduler is indeed something that should help you with this. We apologize for not posting an update in this issue to go along with that announcement. The best part is that it's enabled by default for your applications managed by the service scheduler!
Resolving this issue for now. Please reach out to us if you have any follow up questions/comments.
cc @dovreshef @bwalding @rhyas @hwatts @panmanphil
Thanks,
Aniurdh
@aaithal out of curiosity: if we have problems with circuit breaking logic, where should we report it?
Over the past two days we had situations where I saw the circuit breaker work, and others where it didn't catch the problem.
@melo I'd recommend reaching out to AWS support for reporting such issues. You can open a new issue here, but it's strictly not what this Github repo is intended for, and we'd ask you for a bunch of information to locate your tasks/deployments etc, which the AWS Support channel is better tooled to handle.
Thanks,
Anirudh
Most helpful comment
Somewhat, but not completely. The reason for opening this is: If a container definition is wrong, or arguments bad, or something else in error that cause the container to die upon start, ECS will continually try to restart it forever. This seems like it can make a mess of deferred volume cleanup in the devicemapper driver, and seems to potentially be causing an issue where when cleaning up from such a scenario. The docker daemon burns through all the IOPS available to clean up failed containers and eventually puts the box in a state where it can't keep up with container cleanup because of IOPS limitations. I don't know if this is something that has to be fixed in the Client, or in the Server/Scheduler of ECS. But it seems a likely issue with running tasks that needs to be addressed. Essentially, rolling bad containers to a cluster can DDOS the nodes in the cluster in ECS's current form.