Amazon-ecs-agent: Feature Request: Limit Container Restarts.

Created on 21 May 2017 · 12Comments · Source: aws/amazon-ecs-agent

Add the ability to limit the number of container starts. e.g. if a container won't start after 5 tries, stop trying to start it. Having restarts happen forever on containers with errors starting puts a large load on dockerd to deal with volumes setup for containers that failed to start.

kinfeature request scopECS Service

Source

rhyas

👍18

Most helpful comment

Somewhat, but not completely. The reason for opening this is: If a container definition is wrong, or arguments bad, or something else in error that cause the container to die upon start, ECS will continually try to restart it forever. This seems like it can make a mess of deferred volume cleanup in the devicemapper driver, and seems to potentially be causing an issue where when cleaning up from such a scenario. The docker daemon burns through all the IOPS available to clean up failed containers and eventually puts the box in a state where it can't keep up with container cleanup because of IOPS limitations. I don't know if this is something that has to be fixed in the Client, or in the Server/Scheduler of ECS. But it seems a likely issue with running tasks that needs to be addressed. Essentially, rolling bad containers to a cluster can DDOS the nodes in the cluster in ECS's current form.

rhyas on 27 May 2017

👍4

All 12 comments

samuelkarp on 22 May 2017

rhyas on 27 May 2017

👍4

I think some kind of exponential backoff pattern for restarts would be more useful here. I'm not sure that a fixed number of retries is that helpful as there may be transient failure conditions (databases unavailable/dependent services down etc.) that will eventually recover.

Logging the number of task stopped events directly to cloudwatch would probably help in terms of generating alerts when bad service configs get introduced to a cluster - we had to create our own from the cloudwatch events + lambda

We've also observed that sometimes the services-stable command returns successful in these situations even though the new task definition is essentially broken (starts and fails immediately), so we'd get less of these problems if there was a way to explicitly fail and roll back a deployment if it introduces constant restarts.

hwatts on 30 May 2017

a backoff algorithm would probably work as well, just something that make the agent more intelligent when dealing with broken services/definitions, ensuring that it doesn't bring down the whole node, and potentially whole cluster, by just cramming restarts in until the nodes run out of IOPS if using EBS cleaning up all the dead containers.

rhyas on 5 Jun 2017

It would also be pretty helpful if in the case of "flapping services" as we've been calling them, increase the cleanup of those failed containers. A very aggressive backoff could work too as has been pointed out

panmanphil on 5 Jun 2017

@rhyas

Somewhat, but not completely.

Yep, that's exactly why I didn't close this as a duplicate :smile:

samuelkarp on 5 Jun 2017

We suffer from the same issue with bad images spawning repeatedly and consuming docker resources - we end up with the message below (we have a Ubuntu based images that run a very slightly modified version of the ECS agent).

This renders the ECS instance unusable until we cleanup / terminate it.

devmapper: Thin Pool has 306 free metadata blocks which is less than minimum required 307 free metadata blocks. Create more free metadata space in thin pool or use dm.min_free_space option to change behavior

bwalding on 22 Jun 2017

We are also having this problem.
I agree that exponential backoff pattern would be best. Except when the the problem is a bad image update, in which case the service would keep trying forever.
What the recommend solution for bad image service updates? We're deploying service updates automatically using cloudformation templates, and bad update will hang the cloudformation stack for extremely long time.

dovreshef on 16 Nov 2017

Hmms... Does this mean that this issue is solved?

Circuit Breaking Logic for the Amazon ECS Service Scheduler

Thanks,

melo on 17 Jan 2018

Hi @melo,

Circuit Breaking Logic for the Amazon ECS Service Scheduler is indeed something that should help you with this. We apologize for not posting an update in this issue to go along with that announcement. The best part is that it's enabled by default for your applications managed by the service scheduler!

Resolving this issue for now. Please reach out to us if you have any follow up questions/comments.

cc @dovreshef @bwalding @rhyas @hwatts @panmanphil

Thanks,
Aniurdh

aaithal on 17 Jan 2018

@aaithal out of curiosity: if we have problems with circuit breaking logic, where should we report it?

Over the past two days we had situations where I saw the circuit breaker work, and others where it didn't catch the problem.

melo on 18 Jan 2018

@melo I'd recommend reaching out to AWS support for reporting such issues. You can open a new issue here, but it's strictly not what this Github repo is intended for, and we'd ask you for a bunch of information to locate your tasks/deployments etc, which the AWS Support channel is better tooled to handle.

Thanks,
Anirudh

aaithal on 18 Jan 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings