Amazon-ecs-agent: ECS Service Discovery registering to Route53 before service is healthy

Created on 24 Jun 2018 · 16Comments · Source: aws/amazon-ecs-agent

Summary

ECS Service Discovery registering to Route53 before service is healthy

Description

ECS Service Discovery registering to Route53 before service is healthy, Route53 is returning SRV records in dig as soon as task starts and is not waiting for initial health success.

Below is my setup:
I am trying with ECS service discovery in bridge mode with SRV record type, and i configured HealthCheck Custom config FailureThreshold as 1, and health check grace period as 900.
My service takes around 180 secs to become healthy(Java based service), while target group is correctly waiting for my service to become healthy before routing traffic, Route53 is retuning SRV records before service turns healthy. This behaviour is rendering ECS service discovery useless.

Expected Behavior

Route53 should return SRV records only after service turns healthy.

Observed Behavior

Route53 should is returning service SRV records before service turns healthy.

Environment Details

Supporting Log Snippets

scopECS Service

Source

SaikiranDaripelli

👍7

Most helpful comment

bumping this as well. This is a non-starter for a lot of teams looking to use ECS with service discovery. A huge sell of rolling updates + health checks is to validate the service before bringing it online to avoid downtime.

schmohlio on 4 Feb 2019

👍8

All 16 comments

are you using container health? https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-ecs-supports-container-health-checks-and-task-health-mana/

hridyeshpant on 28 Jun 2018

@hridyeshpant I am using ALB/Target group healthchecks to did not configure Docker healthchecks. Does service discovery depends on Docker healthchecks and not on Target group healthcheck?

SaikiranDaripelli on 28 Jun 2018

@SaikiranDaripelli

ALB probably does not know what to check until ECS puts SRV record into route 53.

bzcnsh on 20 Jul 2018

According to the official documentation:

You can configure service discovery for an ECS service that is behind a load balancer, but service discovery traffic is always routed to the task and not the load balancer.

Which implies to me that ALB does not integrate in any way with Route 53/Service Discovery. Make sure you configure your health within the task definition.

You can provide a startPeriod within your task definition (disabled by default). This probably is what you are looking for.

bitbrain on 2 Oct 2018

@SaikiranDaripelli i can conform and had discussion with AWS team. They are working on this fix but no ETA. @bitbrain even ECS Service Discovery registering is ignoring Taskhealth check and adding A records in pool even task health check is failing and ECS scheduler keep trying to run task in other Ec2 instances in cluster. So A records are keep adding and getting removed.

hridyeshpant on 3 Oct 2018

👍7 🎉1

@hridyeshpant Okay, this definitely sounds like a bug to me. Good to hear AWS is aware of this.

bitbrain on 4 Oct 2018

schmohlio on 4 Feb 2019

👍8

Thank you all for providing feedback on this issue. We've deployed a fix for this where we register instances as unhealthy in ECS Service Discovery during the SD provisioning state.

ellenthsu on 1 Mar 2019

🎉2 👍1

@ellenthsu As random as it might it sound, is there a way to force the agent into registering unhealthy instances?

tomaszdudek7 on 12 Jul 2019

@spaszek why not just set a fake health check on the instance? there are examples on AWS, but something like echo "hello world" should suffice.

schmohlio on 12 Jul 2019

We try to fake a health check using some kind of a setup container, which will say it is healthy in expectation of a dependent container, starting later, to find the DNS record.
But what we found is that the DNS record is only created after all the dependent containers in a task start. This makes it impossible for a container to run expecting a DNS record.

avivek on 31 Dec 2019

It seems that this bug has resurfaced. We are seeing DNS records show up for hosts in RUNNING state before they reach HEALTHY.

@ellenthsu we've filed support tickets for this but haven't make much traction. Seems like a serious bug. Is there something that changed in how the underlying Cloud Map APIs interact with ECS?

schmohlio on 31 Jan 2020

More specifically, we see that previous container versions are stopped before the new container version is HEALTHY (they are stopped when new containers are RUNNING).

schmohlio on 31 Jan 2020

@schmohlio we have identified a situation where some slow start tasks could be treated healthy prematurely. We have responded in the opened ticket and will deploy an improvement this month to mitigate that issue.

guoqings on 5 Feb 2020

thank you @guoqings for the update! Will look out for the deployment.

schmohlio on 5 Feb 2020

@spaszek why not just set a fake health check on the instance? there are examples on AWS, but something like echo "hello world" should suffice.

I'm stuck with this issue. My service tries to create a listener at 'serviceName:portNumber' but fails because the record doesn't get added to route53 because the health check fails. Could you point me to one of these examples of a fake health check?

pocockn on 30 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

Best practice to ship ECS agent logs to Cloudwatch Logs?

melo · 5Comments

Option to select multiple log-configuration

soumyasmruti · 5Comments

devicemapper leaking

cjbottaro · 4Comments

Can not acquire network metric in EC 2/Bridge mode

hayajo · 3Comments