Containers-roadmap: ECS Service Discovery not respecting TTL when updating service

Created on 4 Oct 2018 · 18Comments · Source: aws/containers-roadmap

Summary

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

Description

When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).

To reproduce - create an ECS service from a simple "hello world" type task definition that runs forever and does nothing. Set min healthy to 100, max to 200, count to 1. Setup service discovery and create a DNS record with a long TTL, say 300s. Update the service to use a new revision of the task definition (no changes to the task def needed), and note that the tasks are stopped before the TTL time is reached.

Expected Behavior

ECS Agent should remove the route53 record(s), and then wait to stop the tasks after the TTL duration has elapsed.

Observed Behavior

ECS Agent does not wait any additional time when stopping tasks for services that use service discovery.

Environment Details

Supporting Log Snippets

ECS Proposed

Source

matthewduren

👍55

Most helpful comment

We are facing same issue, any updates on this?

thanks 😄

victor-paddle on 28 Apr 2020

👍9

All 18 comments

The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.

himberjack on 3 Nov 2018

👍6

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

AndrewLugg on 19 Nov 2018

👍4 😕2

Right, when it starts draining it should immediately remove the r53 record. After the TTL has elapsed the normal SIGTERM signal should be sent to the container, followed by SIGKILL 30 seconds later if the task is still up just like how tasks that don't use service discovery behave.

matthewduren on 19 Nov 2018

👍3

Was hoping to avoid using an lb and this came to mind, sad to see it's an unresolved issue :(

melbourne2991 on 21 Mar 2019

I'm on the same issue.

I am building an infrastructure for gRPC services by using ECS Fargate and it's service discovery feature, without having ELB. The communication between services is transferred by Envoy proxy. Each envoy listener is served via ECS' service discovery.

I got some of the gRPC unavailable error during updating the service. The envoy that is transferring a request to other envoy could lose all the upstream connections since there is a moment that the envoy only knows old container's ip addresses, which are already dead by SIGTERM. As workaround, I configured the DNS TTL as very short value such as 3s, though I got still errors in a short period of time(about in 10 seconds).

I hope that the issue will be resolved.

nikushi on 29 Mar 2019

👍2

I hope that the issue will be resolved too :(

kuongknight on 15 Apr 2019

👍1

+1.

afawaz2 on 17 Apr 2019

FYI:
If you put the minimum healthy to 0% & max to 100%; i.e stop everything before starting new instances, your service is unreachable for several minutes due to negative DNS caching:
I've been experimenting with a service that I really always only want one instance of running, this is what a restart looks like:

For roughly 4 minutes the service is ready to accept connections, but the DNS returns a NXDOMAIN. So don't try to use Service Discovery for this purpose.
Also note that the VPC dns resolver does not adhere to the 24h TTL set in the SOA record for the service discovery dns zone. But you cannot change that TTL anyway, so I guess we should be happy with that and the service not being unreachable for 24h.

Though I'd mention this caveat here since this is where I ended up researching SD ttls.

holstvoogd on 26 Jul 2019

👍8

We are facing same issue, any updates on this?

thanks 😄

victor-paddle on 28 Apr 2020

👍9

We are facing same issue, any updates on this?

thanks 😄

joeke80215 on 29 May 2020

👍4

The issue still exists.

2mositalebi on 8 Sep 2020

👍7

There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.

In addition to this, the instance health should change to "unhealthy" so that any api based calls does not see the instance as healthy. Similar to "deregistration delay" in target groups. Also discussed here: https://github.com/aws/containers-roadmap/issues/473

https://github.com/aws/containers-roadmap/issues/1039
Allow configuring envoy connection draining: https://github.com/aws/aws-app-mesh-roadmap/issues/252
Current vs desired behaviour: https://github.com/spinnaker/spinnaker/issues/5542#issuecomment-595732111

awsiv on 30 Sep 2020

Having the same issue with our gRPC server in ECS service and service discovery. Also, we are utilizing spot instances for the service. The gRPC clients cannot call the gRPC server when there is a spot instance interruption, although ECS has spawned a new task before the current task stopped.

Hope this issue will be fixed soon.

rilutham on 8 Oct 2020

👍1

We are facing same issue :(

kuongknight on 12 Oct 2020

❤1

We are facing same issue :)

hgsgtk on 16 Nov 2020

Reading through the linked issue, that bug is related to not respecting TTLs. The bug we fixed in ECS was an ordering issue where some tasks may be stopped before new tasks are actively visible in DNS.

https://github.com/aws/aws-app-mesh-roadmap/issues/151

hgsgtk on 16 Nov 2020

We are facing the same issue. I created a tool that let us graph the behavior. But basically I've seen the HTTP 503 errors show up AFTER ECS is done deploying new tasks and after the old tasks are shutdown. The Y-axis in the below graph are HTTP status code. Ignore the fact that my service was returning a 403. I wasn't providing a token but that is unrelated to this point.
aesyay

CraigHead on 16 Nov 2020

I noticed during an ECS Fargate deployment ServiceDiscovery will return a empty array for a short period of time
e.g.
aws servicediscovery discover-instances --namespace-name my-namespace --service-name MyService