Tell us about your request
We're looking into Fargate Spot and were wondering if there were improvements to connection draining for Load Balancers coming, such as the ECS Spot Automatic Connection draining. If not, what is the current recommendation for handling this when the termination notice goes out?
Which service(s) is this request for?
Fargate
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We want to make sure terminating Fargate tasks are seamlessly replaced without just being killed.
Are you currently working around this issue?
We're looking into Fargate Spot now; we use normal Farge and ECS Spot with automatic connection draining.
This is critical to any sane usage of fargate spot in production. Could we at least get a work around?
@Sytten @hlarsen
The workaround should be something like:
1) Detect that a Spot task is pending termination
Seems the only way to detect this is via CW Events: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html#fargate-capacity-providers-termination - For EC2 there is an API (https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/) available for this information but it does not seem it is available for Fargate Spot
2) De-register the target in the Target Group
Could be done with CW Events Rule => Lambda.
We do the above currently for EC2 Spot termination, but instead of removing the task target from the target group, we issue an ECS Container Instance DRAIN command to remove all tasks from the EC2 host gracefully. But the principle is the same.
I would check an Fargate Spot in meta-data API further to see if that information is present (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html) but from the looks if it isn't present.
I just saw that you could as well block the SIGTERM handler to do a graceful shutdown (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html#fargate-capacity-providers-termination). The workflow would then look like this:
1) Interrupt SIGTERM in your container
2) Once SIGTERM is detected, return false on health check generated by the Target Group (not to be confused with the Docker health check)
3) action nr 2 should label the target in your Target Group as unhealthy, and not send more requests towards it. I haven't checked, but I assume in-flight requests would be allowed to finish (e.g. WebSocket connections)
Hope this gives some input on what could be done
@toredash thanks for the reply. fyi unless i'm missing something you no longer have to do that for EC2 Spot Terminations, it should be taken care of automatically by _Automated Draining for Spot Instances_, correct? it's working well for us.
this type of thing is more what i'm asking about - the cw event > manual draining solution was used by many people for EC2 before it was automated by AWS so users didn't have to deal with the extra automation, make sure it works, monitor the additional automation, etc etc.
if AWS is planning on doing the same Automated Connection Draining for Fargate Spot i don't want to waste time with the extra automation if we can just wait a few months as moving to Fargate Spot isn't a huge priority for us.
@toredash thanks for the reply. fyi unless i'm missing something you no longer have to do that for EC2 Spot Terminations, it should be taken care of automatically by _Automated Draining for Spot Instances_, correct? it's working well for us.
That is correct, but we don't use that feature as we have built our own logic to reduce the chances of downtime. Our solution is based on CloudWatch Events, were we listen to Spot Termination signal, and trigger a lambda. Said lambda will check if the EC2 node is part of an ECS Cluster _and_ ASG. If yes, it will label the EC2 node as unhealthy in the ASG at once, so that the ASG will request a new spot instance at once. Then it will drain the node as normal. With this we have new capacity ready before the terminating spot instance is reclaimed by AWS.
this type of thing is more what i'm asking about - the cw event > manual draining solution was used by many people for EC2 before it was automated by AWS so users didn't have to deal with the extra automation, make sure it works, monitor the additional automation, etc etc.
True, but we continue with our own approach because of what I've written abov.e
if AWS is planning on doing the same Automated Connection Draining for Fargate Spot i don't want to waste time with the extra automation if we can just wait a few months as moving to Fargate Spot isn't a huge priority for us.
I have no idea if they intend to, my building this into the container should not be that difficult, but that of course depends on your application.
Most helpful comment
@Sytten @hlarsen
The workaround should be something like:
1) Detect that a Spot task is pending termination
Seems the only way to detect this is via CW Events: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html#fargate-capacity-providers-termination - For EC2 there is an API (https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/) available for this information but it does not seem it is available for Fargate Spot
2) De-register the target in the Target Group
Could be done with CW Events Rule => Lambda.
We do the above currently for EC2 Spot termination, but instead of removing the task target from the target group, we issue an ECS Container Instance DRAIN command to remove all tasks from the EC2 host gracefully. But the principle is the same.
I would check an Fargate Spot in meta-data API further to see if that information is present (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v4.html) but from the looks if it isn't present.
I just saw that you could as well block the SIGTERM handler to do a graceful shutdown (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html#fargate-capacity-providers-termination). The workflow would then look like this:
1) Interrupt SIGTERM in your container
2) Once SIGTERM is detected, return false on health check generated by the Target Group (not to be confused with the Docker health check)
3) action nr 2 should label the target in your Target Group as unhealthy, and not send more requests towards it. I haven't checked, but I assume in-flight requests would be allowed to finish (e.g. WebSocket connections)
Hope this gives some input on what could be done