Amazon-ecs-agent: Respect container/task health status when scaling the service

Created on 15 Mar 2018 · 22Comments · Source: aws/amazon-ecs-agent

Summary

Issue from a customer in #534:

I've had a look at this today and it doesn't look like ECS observes the health status during deployments. Is this by design?

For e.g. create a service with one healthy container and perform a deployment (min 100% max 200%) that's broken and goes to UNHEALTHY, the healthy container (old version) is stopped, the deployment completes and the unhealthy container (new version) remains and is then terminated over and over as it's unhealthy.

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

kinenhancement scopECS Service

Source

richardpen

👍25

Most helpful comment

I experiencing a similar issue:

I set up HEALTH CHECK command and it would return true when the task is ready.
My task takes more than 5 min to be healthy since there are bunch of things on initialization.
After it's initialized, the heatlhcheck command will get 'true'.
But, when it's updating as rolling, I've observed the next task was terminated even though the first task is not ready.

My point is: ECS has to wait until the new task's health status is 'HEALTHY', but it just terminates and runs the second task on my service.

however, ECS just kill old task even when the new task is not ready to be served.
I'd like to know How I can sync both my task's 'READY_TO_SERVE' status and 'RUNNING' state

can we confirm the issue?

eugenebut12 on 28 Jan 2020

👍8

All 22 comments

Hi!
Can you give me estimate when this feature will be released, or should I use ALB/ELB to evade this issue?

gbotka on 20 Mar 2018

lnkisi on 21 Mar 2018

👎4 😕1

coltaa on 23 Apr 2018

👎2

Perhaps related: we have an ECS service using ELB and a health-check grace period. When we deploy a new version of the task definition for this service, it appears that existing tasks are stopped immediately rather than waiting for the health check to succeed (which it eventually does, at which point the service becomes healthy again).

alexhall on 24 Apr 2018

👍2

Yeah, this is what this issue is tracking I believe. Health checks are not observed at all during deployments.

hwatts on 25 Apr 2018

Yeah, I wasn't sure if this issue was targeted specifically at container/task health checks or all health checks.

Specifically for the case of ELB health checks, the docs seem to imply that they should already be respected:

tasks for services that do use a load balancer are considered healthy if they are in the RUNNING state and the container instance on which it is hosted is reported as healthy by the load balancer.

... which leaves me wondering if there's a problem in my configuration, or a problem in ELB.

alexhall on 25 Apr 2018

Sorry, I misunderstood. Assuming you're using ELB health checks rather than the new docker health checks, then this should work. The most obvious thing to check would be that minimumHealthyPercent is set to 100. Probably going a bit off topic for people who are watching for the original bug to be fixed, so perhaps start a post on the ECS forum or raise a support ticket if it's not that.

hwatts on 25 Apr 2018

I can also confirm that with the latest ECS agent version that the healthcheck is applied but it is not being respected except when the task is unhealthy or doesn't become healthy before the end of the start period. I have tried deployments and instance draining with minimumHealthyPercent set to 100 percent and the healthy task is killed before the new one becomes healthy. ELB healthchecks are respected however.

gunzy83 on 11 May 2018

Sounds like a big issue. How can we help ?

kwent on 11 May 2018

@richardpen any movement on this one...? would be great to know Docker health checks are honoured on deployments (only using them to mark containers unhealthy feels like the original feature requested in https://github.com/aws/amazon-ecs-agent/issues/534 is half baked :( )

CpuID on 15 May 2018

👍2

I'm using ECS with Consul as service discovery and this problem is causing downtime when updating my production services :disappointed:

Any solution for this issue?

rodolphocouto on 29 May 2018

👍2

EmmN on 5 Jun 2018

👎2

ghost on 5 Jun 2018

👎1

@rodolphocouto we are doing the same thing but there are two things we do to get around this:

Deploy a new copy of the service using Cloudformation then remove the old one (we create a hash of the app version and deployment code version to make these unique).
Attach all of our services to a single dummy ALB for healthchecks on /health. This allows us to do cluster AMI updates and scale in and out EC2 safely (with lifecycle hooks to drain terminating instances).

We have almost 50 microservices running like this.

gunzy83 on 8 Jun 2018

😄1

@gunzy83 thanks for your tips, we think about solving with blue-green deployment, but we'll consider your workaround too.

rodolphocouto on 8 Jun 2018

Thanks for your patience, the service team has fixed this issue, please try it again. I'm closing this issue now, feel free to reopen if you have further question about this.

Thanks,
Peng

richardpen on 26 Jul 2018

🎉4 👍2 ❤1 😕1

I experiencing a similar issue:

My point is: ECS has to wait until the new task's health status is 'HEALTHY', but it just terminates and runs the second task on my service.

however, ECS just kill old task even when the new task is not ready to be served.
I'd like to know How I can sync both my task's 'READY_TO_SERVE' status and 'RUNNING' state

can we confirm the issue?

eugenebut12 on 28 Jan 2020

👍8

We are also running into an issue where ECS does not respect health checks during rolling deploys. If a new deploy has tasks that fail health checks, the healthy tasks are killed.

We tested this by:

Changing the health check command to exit 1 to fail the health check
Deploy
See that previous healthy tasks are terminated, and the new tasks will keep flapping since they'll start, but then get killed after failing health check.

dounan on 4 Feb 2020

runnning into same issue as @eugenebut12

I saw this ticket is marked as fixed, but I am still running into the same issue as this ticket described

Reproduce:

Running ECS Fargate with Service Discovery
Deployment strategy is Rolling Update

When updating a service, the healthy task gets killed, (healthy instance gets deregistered from cloud map) before the new task pass health check

jchenseated on 5 Apr 2020

👍3

This is definitely NOT working as one expects. Services with tasks whose containers have a long startup time have their service discovery records created in R53 well before they are HEALTHY. At the same time, the HEALTHY services they are replacing are getting killed off, resulting in a period of time where the only R53 records in the service discovery zone are those pointing to services that are not ready.

AWS's solution to this is to create an ALB and hook the service into it. While this results in a properly functioning Cluster during a Service Update, it leaves much to be desired, and is not technically a fix for the seemingly broken Service Discovery feature of ECS.

Short of fixing the Service Discovery feature in ECS to actually respect the health status of Tasks (which is the only real way to address this issue), one potential way to ameliorate this is to introduce the ability to delay the killing off of old HEALTHY tasks (like the Deregsitration Delay in ALB Target Groups).

rezaetezal on 4 Jun 2020

We are also running into an issue where ECS does not respect health checks during rolling deploys. If a new deploy has tasks that fail health checks, the healthy tasks are killed.
We tested this by:
    Changing the health check command to exit 1 to fail the health check
    Deploy
    See that previous healthy tasks are terminated, and the new tasks will keep flapping since they'll start, but then get killed after failing health check.

I'm attempting to repro this simple case. (I tested this on both Fargate and ECS EC2)
My Steps:

create a simple taskdefintion (busybox:latest sleep 3000) with no healthcheck defined.
start up a basic service to run 10 tasks of the above. (Minimum healthy percent 100 Maximum percent 120)
update the taskdefinition to add a healthcheck which always fails (exit 1)
update the service to use the new taskdef.
I see the new tasks launch and ultimately fail while the original set of healthy tasks remains in place.
All this is the expected behavior.

If you have steps to repro the issue which contradict the basic case I've outlined here, please share them with us. It would be good to rule out the basic case so we can focus on the startup time.

I see multiple comments related to a long startup time for a task. We'll continue working on new repro steps under the assumption that the core issue here is related to tasks with a long startup time. If you have clear/specific steps to repro the long startup time issues, please share as it will fast-forward our process.

fierlion on 4 Feb 2021

@fierlion Can you try this repro with an ALB in the mix?

The task is added to the target group before the Health-Checks pass, and the task is Healthy. The task should only be added to the Target Group after waiting for the health checks to pass.