ECS Service reaches steady state before container healthcheck pass. This results in ECS update getting deployed via cloudformation successfully, However, the tasks go into restart loop after container health check fails rendering whole ECS service unstable.
when we perform an update via cloudformation to ECS. Cloudformation waits for "steady state" event to mark the update successful. "Service xxxxxxxx has reached a steady state." I can see the ECS service reaches the steady state and then the health checks fail, Which results cloud-formation deployment successfully but rendering ECS service unstable.
ECS Service should reach steady state after container healthcheck pass
ECS Service reaches steady state before container healthcheck pass
@mailjunze
Could you please mail me the following information at shugy at amazon dot com to help debug this issue --
1) Which version of Agent and Docker are you running?
2) How often do you observe this error?
3) Date/time of when the issue occurred, and ECS agent's configuration and debug level log files following instructions here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html
Haven't got back the requested information. Will close the issue for now. Feel free to reopen it when you have them ready.
Is there an easier way to check if the ECS service has reached the "steady" (not "stable") state other than constantly querying the last events and looking for the word "steady" in there?
we are seeing this issue and it's affecting our services. Is there a way to escalate?
We will be releasing a fix for this soon. With this change, all container health checks must be in a HEALTHY status before a Service or TaskSet can reach STEADY_STATE. This fix will apply to services using ECS/Rolling deployments controllers (Steady State event), external deployment controllers (TaskSet stabilityStatus) and CodeDeploy Blue/Green (deployments will now wait for Container Health Checks to pass before flipping traffic). Services not using container health checks will see no change in behavior. We will update this ticket once the change has been released.
The bugfix has been deployed in all regions. Feel free to reply or reopen this issue if the problem persists.
@DireCorgi does this fix only apply to stack updates? Because I am able to create an ECS service with a failing container health check via CloufFormation. Stack and Service are CREATE_COMPLETE, but the service just keeps restarting the container with the failing health check.
Most helpful comment
The bugfix has been deployed in all regions. Feel free to reply or reopen this issue if the problem persists.