Amazon-ecs-agent: ECS Service Discovery registering to Route53 before service is healthy

Created on 24 Jun 2018  路  16Comments  路  Source: aws/amazon-ecs-agent

Summary

ECS Service Discovery registering to Route53 before service is healthy

Description

ECS Service Discovery registering to Route53 before service is healthy, Route53 is returning SRV records in dig as soon as task starts and is not waiting for initial health success.

Below is my setup:
I am trying with ECS service discovery in bridge mode with SRV record type, and i configured HealthCheck Custom config FailureThreshold as 1, and health check grace period as 900.
My service takes around 180 secs to become healthy(Java based service), while target group is correctly waiting for my service to become healthy before routing traffic, Route53 is retuning SRV records before service turns healthy. This behaviour is rendering ECS service discovery useless.

Expected Behavior

Route53 should return SRV records only after service turns healthy.

Observed Behavior

Route53 should is returning service SRV records before service turns healthy.

Environment Details

Supporting Log Snippets

scopECS Service

Most helpful comment

bumping this as well. This is a non-starter for a lot of teams looking to use ECS with service discovery. A huge sell of rolling updates + health checks is to validate the service before bringing it online to avoid downtime.

All 16 comments

@hridyeshpant I am using ALB/Target group healthchecks to did not configure Docker healthchecks. Does service discovery depends on Docker healthchecks and not on Target group healthcheck?

@SaikiranDaripelli

ALB probably does not know what to check until ECS puts SRV record into route 53.

According to the official documentation:

You can configure service discovery for an ECS service that is behind a load balancer, but service discovery traffic is always routed to the task and not the load balancer.

Which implies to me that ALB does not integrate in any way with Route 53/Service Discovery. Make sure you configure your health within the task definition.

You can provide a startPeriod within your task definition (disabled by default). This probably is what you are looking for.

@SaikiranDaripelli i can conform and had discussion with AWS team. They are working on this fix but no ETA. @bitbrain even ECS Service Discovery registering is ignoring Taskhealth check and adding A records in pool even task health check is failing and ECS scheduler keep trying to run task in other Ec2 instances in cluster. So A records are keep adding and getting removed.

@hridyeshpant Okay, this definitely sounds like a bug to me. Good to hear AWS is aware of this.

bumping this as well. This is a non-starter for a lot of teams looking to use ECS with service discovery. A huge sell of rolling updates + health checks is to validate the service before bringing it online to avoid downtime.

Thank you all for providing feedback on this issue. We've deployed a fix for this where we register instances as unhealthy in ECS Service Discovery during the SD provisioning state.

@ellenthsu As random as it might it sound, is there a way to force the agent into registering unhealthy instances?

@spaszek why not just set a fake health check on the instance? there are examples on AWS, but something like echo "hello world" should suffice.

We try to fake a health check using some kind of a setup container, which will say it is healthy in expectation of a dependent container, starting later, to find the DNS record.
But what we found is that the DNS record is only created after all the dependent containers in a task start. This makes it impossible for a container to run expecting a DNS record.

It seems that this bug has resurfaced. We are seeing DNS records show up for hosts in RUNNING state before they reach HEALTHY.

@ellenthsu we've filed support tickets for this but haven't make much traction. Seems like a serious bug. Is there something that changed in how the underlying Cloud Map APIs interact with ECS?

More specifically, we see that previous container versions are stopped before the new container version is HEALTHY (they are stopped when new containers are RUNNING).

@schmohlio we have identified a situation where some slow start tasks could be treated healthy prematurely. We have responded in the opened ticket and will deploy an improvement this month to mitigate that issue.

thank you @guoqings for the update! Will look out for the deployment.

@spaszek why not just set a fake health check on the instance? there are examples on AWS, but something like echo "hello world" should suffice.

I'm stuck with this issue. My service tries to create a listener at 'serviceName:portNumber' but fails because the record doesn't get added to route53 because the health check fails. Could you point me to one of these examples of a fake health check?

Was this page helpful?
0 / 5 - 0 ratings