When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).
When updating a service or otherwise scaling out ECS tasks for a service that uses Service Discovery, tasks are being stopped before reaching the TTL of the service discovery record(s).
To reproduce - create an ECS service from a simple "hello world" type task definition that runs forever and does nothing. Set min healthy to 100, max to 200, count to 1. Setup service discovery and create a DNS record with a long TTL, say 300s. Update the service to use a new revision of the task definition (no changes to the task def needed), and note that the tasks are stopped before the TTL time is reached.
ECS Agent should remove the route53 record(s), and then wait to stop the tasks after the TTL duration has elapsed.
ECS Agent does not wait any additional time when stopping tasks for services that use service discovery.
The current behavior is outrageous, that such flaw exists in AWS service. It is forcing us to use ELB/ALB and to increase unnecessary costs, besides performance impacts.
There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.
Right, when it starts draining it should immediately remove the r53 record. After the TTL has elapsed the normal SIGTERM signal should be sent to the container, followed by SIGKILL 30 seconds later if the task is still up just like how tasks that don't use service discovery behave.
Was hoping to avoid using an lb and this came to mind, sad to see it's an unresolved issue :(
I'm on the same issue.
I am building an infrastructure for gRPC services by using ECS Fargate and it's service discovery feature, without having ELB. The communication between services is transferred by Envoy proxy. Each envoy listener is served via ECS' service discovery.
I got some of the gRPC unavailable error during updating the service. The envoy that is transferring a request to other envoy could lose all the upstream connections since there is a moment that the envoy only knows old container's ip addresses, which are already dead by SIGTERM. As workaround, I configured the DNS TTL as very short value such as 3s, though I got still errors in a short period of time(about in 10 seconds).
I hope that the issue will be resolved.
I hope that the issue will be resolved too :(
+1.
FYI:
If you put the minimum healthy to 0% & max to 100%; i.e stop everything before starting new instances, your service is unreachable for several minutes due to negative DNS caching:
I've been experimenting with a service that I really always only want one instance of running, this is what a restart looks like:
聽 | Event | Time in second since stop
-- | -- | --
task | stop | 0
task | start | 22
dns | gone | 26
service | listening | 35
ecs | ready | 82
dns | back | 270
For roughly 4 minutes the service is ready to accept connections, but the DNS returns a NXDOMAIN. So don't try to use Service Discovery for this purpose.
Also note that the VPC dns resolver does not adhere to the 24h TTL set in the SOA record for the service discovery dns zone. But you cannot change that TTL anyway, so I guess we should be happy with that and the service not being unreachable for 24h.
Though I'd mention this caveat here since this is where I ended up researching SD ttls.
We are facing same issue, any updates on this?
thanks 馃槃
We are facing same issue, any updates on this?
thanks 馃槃
The issue still exists.
There should also be the option for it to disappear from Route53 when it starts draining, not just enough time for TTL. As requests can come in at the end of TTL, and there not be enough time to process them.
In addition to this, the instance health should change to "unhealthy" so that any api based calls does not see the instance as healthy. Similar to "deregistration delay" in target groups. Also discussed here: https://github.com/aws/containers-roadmap/issues/473
Related:
Having the same issue with our gRPC server in ECS service and service discovery. Also, we are utilizing spot instances for the service. The gRPC clients cannot call the gRPC server when there is a spot instance interruption, although ECS has spawned a new task before the current task stopped.
Hope this issue will be fixed soon.
We are facing same issue :(
We are facing same issue :)
Reading through the linked issue, that bug is related to not respecting TTLs. The bug we fixed in ECS was an ordering issue where some tasks may be stopped before new tasks are actively visible in DNS.
We are facing the same issue. I created a tool that let us graph the behavior. But basically I've seen the HTTP 503 errors show up AFTER ECS is done deploying new tasks and after the old tasks are shutdown. The Y-axis in the below graph are HTTP status code. Ignore the fact that my service was returning a 403. I wasn't providing a token but that is unrelated to this point.

I noticed during an ECS Fargate deployment ServiceDiscovery will return a empty array for a short period of time
e.g.
aws servicediscovery discover-instances --namespace-name my-namespace --service-name MyService
{
"Instances": []
}
Most helpful comment
We are facing same issue, any updates on this?
thanks 馃槃