Containers-roadmap: [ECS] service should let the user specify retry timeout/attempts for deployments

Created on 20 Apr 2019 · 30Comments · Source: aws/containers-roadmap

Summary

ECS should let the user limit time/number of attempts it makes to create new containers before stopping. The current process times out after attempting for hours. This hinders with iterative development process, especially when when making changes through cloud formation templates. The cloudformation stack is either in Update_in_progress or Update_rollback_in_process states for hours.

Description

When an ECS service is created or updated, it tried to start containers for service for hours without failing. When this process is attempted through cloud formation, this could mean that the stack is stuck in an usable state for hours. Even cancelling the update is NOT an option because the it still waits for the containers of the previous task definition comes up to full desired count. Deleting and recreating is not always an option.

Supporting Log Snippets

2019-04-19 Status Type Logical ID Status Reason

<<<<< Notice the end time >>>>
16:11:21 UTC-0700 UPDATE_ROLLBACK_COMPLETE AWS::CloudFormation::Stack {SERVICE}
16:11:20 UTC-0700 DELETE_COMPLETE AWS::ECS::TaskDefinition SomeService
16:11:19 UTC-0700 DELETE_IN_PROGRESS AWS::ECS::TaskDefinition SomeService
16:11:15 UTC-0700 UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS AWS::CloudFormation::Stack {SERVICE}
16:11:13 UTC-0700 UPDATE_COMPLETE AWS::ECS::Service SomeService
15:57:07 UTC-0700 UPDATE_IN_PROGRESS AWS::ECS::Service SomeService
15:57:06 UTC-0700 UPDATE_COMPLETE AWS::ElasticLoadBalancingV2::TargetGroup NlbTargetGroup
15:57:04 UTC-0700 UPDATE_IN_PROGRESS AWS::ElasticLoadBalancingV2::TargetGroup NlbTargetGroup
15:57:04 UTC-0700 UPDATE_COMPLETE AWS::ECS::TaskDefinition SomeService
15:56:46 UTC-0700 UPDATE_ROLLBACK_IN_PROGRESS AWS::CloudFormation::Stack {SERVICE}The following resource(s) failed to update: [SomeService].
15:56:43 UTC-0700 UPDATE_FAILED AWS::ECS::Service SomeService Service arn:aws:ecs:us-east-1:{ACCOUNTID}:service/{SERVICE}did not stabilize.

<<<<<< Notice the time when the update was requested >>>>>
12:55:23 UTC-0700 UPDATE_IN_PROGRESS AWS::ECS::Service SomeService
12:55:19 UTC-0700 UPDATE_COMPLETE AWS::ECS::TaskDefinition SomeService
12:55:19 UTC-0700 UPDATE_IN_PROGRESS AWS::ECS::TaskDefinition SomeService Resource creation Initiated
12:55:18 UTC-0700 UPDATE_IN_PROGRESS AWS::ECS::TaskDefinition SomeService Requested update requires the creation of a new physical resource; hence creating one.
12:55:16 UTC-0700 UPDATE_COMPLETE AWS::ElasticLoadBalancingV2::TargetGroup NlbTargetGroup
12:55:15 UTC-0700 UPDATE_IN_PROGRESS AWS::ElasticLoadBalancingV2::TargetGroup NlbTargetGroup
12:55:09 UTC-0700 UPDATE_IN_PROGRESS AWS::CloudFormation::Stack {SERVICE}User Initiated

ECS Proposed

Source

nhmaha

👍79

Most helpful comment

Hi @nhmaha and @samdammers , that you for your feedback on this issue. We are currently working on a new Feature - which will effectively Build a Deployment Circuit Breaker and Deployments states into ECS. We would like your feedback on the feature description as defined below:

Feature :
"ECS will provide users with a way define what they consider to be an unhealthy deployment in terms of the maximum consecutive failures to launch healthy tasks. You can use the taskLaunchFaultTolerance to define the maximum number of consecutive tasks launch failures post which ECS will consider the deployment as unhealthy and rollback to the previous version. ECS will also introduce six Service deployment states - RUNNING,
COMPLETED. FAILED, ROLLBACK_STARTED, ROLLBACK_COMPLETE and ROLLBACK_FAILED with corresponding ECS Service deployment state change events in CloudWatch"

State definitions:

RUNNING - the service is being deployed with Running Count < Desired Count and new tasks being started with the ‘new’ service version (task set)
COMPLETE - the service has been deployed successfully and satisfies these conditions:
i. Running count = Desired Count (steady state)
ii.All Running tasks are healthy i.e. have passed the container and ELB health checks (at least once)
iii. There are no pending/running tasks with the old service version (task set)
FAILED - the service deployment has been stopped by ECS as it exceeded the maximum retries as defined by the user in the taskLaunchFaultTolerance . All the pending tasks of the new service version (task set) will be stopped once the state has changed to FAILED
ROLLBACK_RUNNING - ECS is moving back to the older version tasks and stopping the tasks for the new service version (task sets) as per the deployment configuration defined by the customer i.e. honoring the minimumHealthyPercent and maximumPercent
ROLLBACK_COMPLETE - The Service has reached steady state with the older version tasks and all tasks are healthy. There are no pending or RUNNING tasks of the ‘new’ service version
ROLLBACK_FAILED: The rollback triggered the circuit breaker by hitting the limit on the number of retries for task launch. ECS will stop all pending tasks and maintain the already RUNNING tasks to satisfy the minimumHealthyPercent criteria

pavneeta on 29 Apr 2020

👍24 🎉8

All 30 comments

We currently get around this by manually setting the desired count to 0, which causes ecs to mark the deployment as successful and allows cloudformation to complete successfully.

A timeout/retry configuration would greatly improve this scenario and allow it to be automated and cause a stack failure appropriately in a far more timely manner.

samdammers on 30 Jul 2019

👍12

The linked issue #291 isn't related to this. That's requesting a timeout at the task level for batch jobs.

hwatts on 5 Aug 2019

State definitions:

RUNNING - the service is being deployed with Running Count < Desired Count and new tasks being started with the ‘new’ service version (task set)
COMPLETE - the service has been deployed successfully and satisfies these conditions:
i. Running count = Desired Count (steady state)
ii.All Running tasks are healthy i.e. have passed the container and ELB health checks (at least once)
iii. There are no pending/running tasks with the old service version (task set)
FAILED - the service deployment has been stopped by ECS as it exceeded the maximum retries as defined by the user in the taskLaunchFaultTolerance . All the pending tasks of the new service version (task set) will be stopped once the state has changed to FAILED
ROLLBACK_RUNNING - ECS is moving back to the older version tasks and stopping the tasks for the new service version (task sets) as per the deployment configuration defined by the customer i.e. honoring the minimumHealthyPercent and maximumPercent
ROLLBACK_COMPLETE - The Service has reached steady state with the older version tasks and all tasks are healthy. There are no pending or RUNNING tasks of the ‘new’ service version
ROLLBACK_FAILED: The rollback triggered the circuit breaker by hitting the limit on the number of retries for task launch. ECS will stop all pending tasks and maintain the already RUNNING tasks to satisfy the minimumHealthyPercent criteria

pavneeta on 29 Apr 2020

👍24 🎉8

@pavneeta is the new feature taskLaunchFaultTolerance already there or still in process? I would be glad to use it since it is quite annoying to wait for half an hour until Fargate gives up to deploy a failing task.

mr42 on 8 May 2020

@pavneeta is the new feature taskLaunchFaultTolerance already there or still in process? I would be glad to use it since it is quite annoying to wait for half an hour until Fargate gives up to deploy a failing task.

We are still working on finalizing the design for this feature. We will update this Issue on teh Roadmap as we make more progress with this.

pavneeta on 12 May 2020

ROLLBACK_RUNNING - ECS is moving back to the older version tasks

What does older version indicate here? Is it the older task definition? If yes, then how will this mechanism handle force update type deployments.

These happen when the task definition defines the image as latest and a force deployment is done to retrieve the latest image. This rollback will not work in these cases. Instead, a "stop current deployment" would help to time to update the image and restart the deployment - but this isn't available as of now.

raags on 25 May 2020

Maybe I'm wrong, but if you're overwriting the latest tag and forcing a new deployment, you lost your safety net to rollback and there's next to nothing anyone can do?
In fact, if you overwrite latest and DON'T deploy, but instead suffer a scale event where new tasks start, they will download a latest image that doesn't match the other latest running. To me it stands to reason that deploying latest is not a good idea if your project requires protection against broken deployments that can auto recover without human intervention.

deleugpn on 30 May 2020

👍1

@deleugpn Yeah I agree. But it is one of the suggested ECS deployment strategies in AWS documentation (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html). The alternative is to keep updating the task definition with changing image ids, which is ok, but not ideal since task def versions will quickly explode. Looks like this wasn't thought through well.

Also, the ability to cancel an active deployment is still useful in certain cases. Right now its not under our control, the deployment will continue happening.

raags on 30 May 2020

👍1

@pavneeta Having the taskLaunchFaultTolerance option will be great. Are you able to provide any update on this?

toneill818 on 21 Jul 2020

👍4

Any updates on this? We've had to make our own RPDK version of the taskLaunchFaultTolerance strategy (which seems to work relatively well) but we'd obviously prefer native support.

samgurtman on 7 Sep 2020

is there any update on this ? when this feature would be coming ? @pavneeta

ramakrishnakommuri on 8 Oct 2020

This feature is being implemented. We plan to release a preview of the solution soon to collect feedback. Please let us know whether you are interested to be included in the preview. Thanks.

guoqings on 9 Oct 2020

👍11 👀2

@guoqings I am interested to participate in the preview and give feedback. I need this feature as well.

samalba on 10 Oct 2020

👍1

@guoqings Same, I'd love to participate in the preview. Thanks.

eactisgrosso on 10 Oct 2020

@guoqings My client is really interested in participating in the preview. Thanks.

jmacgowa-ks on 13 Oct 2020

@guoqings I am also interested in participating in this preview

jeremymcjunkin on 15 Oct 2020

I'm interested in participating in this preview

kareemamin on 30 Oct 2020

Hi everyone, in light of the interest here for the feature, we will be doing a public preview so that all customers can test the new Deployment circuit breaker functionality before it becomes generally available.

Furthermore, this is now a fully managed feature where you do not need to figure out and define the taskLaunchFaultTolerance for the deployment. ECS will automatically monitor the deployment for recurring task launch failure of persistent nature i.e. cases where no new tasks are reaching Running or Healthy state. ECS will fail such deployments and trigger rollback to the previous version of the service.

You can opt into deploymentCircuitBreaker feature and separately into the automaticRollBack feature. You can also use the service events emitted in EventBridge to configure SNS topic for manual intervention or custom logic via a Lambda function.

Hope this helps!

pavneeta on 3 Nov 2020

👍5

@pavneeta will this work / feedback into cloudformation as well to make updates as failed?

mwarkentin on 3 Nov 2020

@pavneeta looks like this feature is available everywhere but CloudFormation; when can we expect CFN integration in the preview?

trov-madisonhope on 23 Nov 2020

Hi Everyone,
This feature is now in Preview, you can use this feature via CLI, CDK and SDK. We are currently working on CloudFormation support and will be releasing it soon . You can read more on the AWS containers Blog here - https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/ .

Please share your feedback here so that we can continuously improve the manged experience for this feature.

Thanks.

pavneeta on 30 Nov 2020

🎉6 ❤1

u can use this feature via CLI, CDK and SDK. We are currently working on CloudFormation support and will be releasing it soon . You can read more on the AWS containers Blog here -

As per the blog, "This feature is available in public preview today, and can be enabled via the AWS Management Console, AWS CLI, and AWS SDK" the feature is not on the console, any update on when it will be rolled out and can we get the blog updated, thanks.

Rana-Salama on 1 Dec 2020

u can use this feature via CLI, CDK and SDK. We are currently working on CloudFormation support and will be releasing it soon . You can read more on the AWS containers Blog here -

As per the blog, "This feature is available in public preview today, and can be enabled via the AWS Management Console, AWS CLI, and AWS SDK" the feature is not on the console, any update on when it will be rolled out and can we get the blog updated, thanks.

@Rana-Salama we fixed the blog post, thanks for calling it out. We are currently working on the CloudFormation (coming soon) and console support (Post GA launch) . Thanks.

pavneeta on 1 Dec 2020

Question regarding the behavior of the automated rollback behavior: If there are tasks associated with the old task definition still running when ECS decided to do an automated rollback, will those tasks be used as part of the rollback? Or will they be terminated and brand new tasks be launched with the old task definition? The blog post makes it seem like the automated rollback basically "promotes" the old deployment back to PRIMARY, so I'd assume tasks from that deployment stay around?

icj217 on 4 Dec 2020

👀1

Question regarding the behavior of the automated rollback behavior: If there are tasks associated with the old task definition still running when ECS decided to do an automated rollback, will those tasks be used as part of the rollback? Or will they be terminated and brand new tasks be launched with the old task definition? The blog post makes it seem like the automated rollback basically "promotes" the old deployment back to PRIMARY, so I'd assume tasks from that deployment stay around?

Yes you are right, based on the deployment configuration you used, that is minimumHealthyPercent and maximumPercent - the tasks corresponding to the older task def will stay around and will be used as part of the rollback. ECS basically switches the old deployment to primary.

pavneeta on 4 Dec 2020

We're having issues with the circuit breaker when we create a new service with two target groups attached. One of the target groups gets marked unhealthy and causes the task to fail yet failedTasks does not go up and no cricuitBreaker gets triggered.

{
    "services": [
        {
            "serviceArn": "xxx",
            "serviceName": "xxx",
            "clusterArn": xxx",
            "loadBalancers": [
                {
                    "targetGroupArn": "xxx",
                    "containerName": "xxx",
                    "containerPort": 8080
                },
                {
                    "targetGroupArn": xxx",
                    "containerName": "xxx",
                    "containerPort": 8080
                }
            ],
            "serviceRegistries": [],
            "status": "ACTIVE",
            "desiredCount": 2,
            "runningCount": 2,
            "pendingCount": 0,
            "launchType": "FARGATE",
            "platformVersion": "1.3.0",
            "taskDefinition": "xxx",
            "deploymentConfiguration": {
                "deploymentCircuitBreaker": {
                    "enable": true,
                    "rollback": false
                },
                "maximumPercent": 200,
                "minimumHealthyPercent": 100
            },
            "deployments": [
                {
                    "id": "ecs-svc/9430161651966094514",
                    "status": "PRIMARY",
                    "taskDefinition": "xxx",
                    "desiredCount": 2,
                    "pendingCount": 0,
                    "runningCount": 2,
                    "failedTasks": 0,
                    "createdAt": "2020-12-09T15:13:23.728000+13:00",
                    "updatedAt": "2020-12-09T15:13:23.728000+13:00",
                    "launchType": "FARGATE",
                    "platformVersion": "1.3.0",
                    "networkConfiguration": {
                        "awsvpcConfiguration": {
                            "subnets": [
                               xxx
                            ],
                            "securityGroups": [
                                xxx
                            ],
                            "assignPublicIp": "DISABLED"
                        }
                    },
                    "rolloutState": "IN_PROGRESS",
                    "rolloutStateReason": "ECS deployment ecs-svc/9430161651966094514 in progress."
                }
            ],
            "roleArn": "xxx",
            "events": [
                {
                    "id": "f955a6b8-282d-42b5-964c-2d01cdb12c2e",
                    "createdAt": "2020-12-09T16:13:14.185000+13:00",
                    "message": "(service xxx) registered 2 targets in (target-group xxx)"
                },
                {
                    "id": "ede0f741-6bd0-4bbd-86b6-b45c0547d997",
                    "createdAt": "2020-12-09T16:13:14.026000+13:00",
                    "message": "(service xxx) registered 2 targets in (target-group xxx)"
                },
                {
                    "id": "f26ce090-a273-48d4-84e4-9946e0d3da5f",
                    "createdAt": "2020-12-09T16:12:33.320000+13:00",
                    "message": "(service xxx) has started 2 tasks: (task b9f706d3cfd4438fa12da4e2886183f5) (task 488bb9786c3c491794b2bcb488ac510e)."
                },
                {
                    "id": "a70957d8-b5a3-468e-a88a-0a56e4d8688a",
                    "createdAt": "2020-12-09T16:12:32.564000+13:00",
                    "message": "(service xxx, taskSet ecs-svc/9430161651966094514) has begun draining connections on 4 tasks."
                },
                {
                    "id": "5491272a-0466-4954-a26b-9b41590d2e34",
                    "createdAt": "2020-12-09T16:12:32.553000+13:00",
                    "message": "(service xxx) deregistered 2 targets in (target-group xxx)"
                },
                {
                    "id": "4ff94502-f5a6-4981-8090-49978ce8edf0",
                    "createdAt": "2020-12-09T16:12:32.503000+13:00",
                    "message": "(service xxx) deregistered 2 targets in (target-group xxx)"
                },
                {
                    "id": "8522272d-8cde-4245-93eb-bf633f981495",
                    "createdAt": "2020-12-09T16:12:22.727000+13:00",
                    "message": "(service xxx) has stopped 2 running tasks: (task dce891f65fc04e7280a4ddc07d51f3ef) (task 4f9856554d17427a9cdfb330e3ab9569)."
                },
                {
                    "id": "e4f869f3-a433-4190-be53-d15976a8b777",
                    "createdAt": "2020-12-09T16:12:22.647000+13:00",
                    "message": "(service xxx) (port 8080) is unhealthy in (target-group xxx) due to (reason Health checks failed)."
                },

samgurtman on 9 Dec 2020

{
    "services": [
        {
            "serviceArn": "xxx",
            "serviceName": "xxx",
            "clusterArn": xxx",
            "loadBalancers": [
                {
                    "targetGroupArn": "xxx",
                    "containerName": "xxx",
                    "containerPort": 8080
                },
                {
                    "targetGroupArn": xxx",
                    "containerName": "xxx",
                    "containerPort": 8080
                }
            ],
            "serviceRegistries": [],
            "status": "ACTIVE",
            "desiredCount": 2,
            "runningCount": 2,
            "pendingCount": 0,
            "launchType": "FARGATE",
            "platformVersion": "1.3.0",
            "taskDefinition": "xxx",
            "deploymentConfiguration": {
                "deploymentCircuitBreaker": {
                    "enable": true,
                    "rollback": false
                },
                "maximumPercent": 200,
                "minimumHealthyPercent": 100
            },
            "deployments": [
                {
                    "id": "ecs-svc/9430161651966094514",
                    "status": "PRIMARY",
                    "taskDefinition": "xxx",
                    "desiredCount": 2,
                    "pendingCount": 0,
                    "runningCount": 2,
                    "failedTasks": 0,
                    "createdAt": "2020-12-09T15:13:23.728000+13:00",
                    "updatedAt": "2020-12-09T15:13:23.728000+13:00",
                    "launchType": "FARGATE",
                    "platformVersion": "1.3.0",
                    "networkConfiguration": {
                        "awsvpcConfiguration": {
                            "subnets": [
                               xxx
                            ],
                            "securityGroups": [
                                xxx
                            ],
                            "assignPublicIp": "DISABLED"
                        }
                    },
                    "rolloutState": "IN_PROGRESS",
                    "rolloutStateReason": "ECS deployment ecs-svc/9430161651966094514 in progress."
                }
            ],
            "roleArn": "xxx",
            "events": [
                {
                    "id": "f955a6b8-282d-42b5-964c-2d01cdb12c2e",
                    "createdAt": "2020-12-09T16:13:14.185000+13:00",
                    "message": "(service xxx) registered 2 targets in (target-group xxx)"
                },
                {
                    "id": "ede0f741-6bd0-4bbd-86b6-b45c0547d997",
                    "createdAt": "2020-12-09T16:13:14.026000+13:00",
                    "message": "(service xxx) registered 2 targets in (target-group xxx)"
                },
                {
                    "id": "f26ce090-a273-48d4-84e4-9946e0d3da5f",
                    "createdAt": "2020-12-09T16:12:33.320000+13:00",
                    "message": "(service xxx) has started 2 tasks: (task b9f706d3cfd4438fa12da4e2886183f5) (task 488bb9786c3c491794b2bcb488ac510e)."
                },
                {
                    "id": "a70957d8-b5a3-468e-a88a-0a56e4d8688a",
                    "createdAt": "2020-12-09T16:12:32.564000+13:00",
                    "message": "(service xxx, taskSet ecs-svc/9430161651966094514) has begun draining connections on 4 tasks."
                },
                {
                    "id": "5491272a-0466-4954-a26b-9b41590d2e34",
                    "createdAt": "2020-12-09T16:12:32.553000+13:00",
                    "message": "(service xxx) deregistered 2 targets in (target-group xxx)"
                },
                {
                    "id": "4ff94502-f5a6-4981-8090-49978ce8edf0",
                    "createdAt": "2020-12-09T16:12:32.503000+13:00",
                    "message": "(service xxx) deregistered 2 targets in (target-group xxx)"
                },
                {
                    "id": "8522272d-8cde-4245-93eb-bf633f981495",
                    "createdAt": "2020-12-09T16:12:22.727000+13:00",
                    "message": "(service xxx) has stopped 2 running tasks: (task dce891f65fc04e7280a4ddc07d51f3ef) (task 4f9856554d17427a9cdfb330e3ab9569)."
                },
                {
                    "id": "e4f869f3-a433-4190-be53-d15976a8b777",
                    "createdAt": "2020-12-09T16:12:22.647000+13:00",
                    "message": "(service xxx) (port 8080) is unhealthy in (target-group xxx) due to (reason Health checks failed)."
                },

@samgurtman thanks for sharing the feedback and the issue. Just to confirm, the tasks are rightly being marked unhealthy? or is that part of the issue you are facing?. My understanding is that undesired behavior with task count not going up and circuit breaker not being triggered is caused due to the Multiple Target groups - we are looking into the issue and will update this thread with mitigation/fix.

pavneeta on 9 Dec 2020

Yes, the tasks do get marked unhealthy and killed. The undesired behaviour is the circuit breaker not triggering. Thanks for looking into this.

samgurtman on 9 Dec 2020