We're using alb.ingress.kubernetes.io/target-type: ip, which makes the ALB target Pods directly. We've had a few container restarts, sometimes bc of OOM on memleaky applications, sometimes because of an application panicking, and every time restarts cause big 5xx spikes.
The spikes make sense. AIC reacts in 3 seconds to any event, in our experience, and that's enough for 2.5k requests to hit a dead container in our application.
I'm asking this question trying to learn how folks deal with this, or whether folks don't use target-type: ip at all when reliability is important.
I'm going to share the options I came up with to avoid this situation, with their downsides:
kubectl deletes Pods before they hit 100% memory usageSpecifically when running Traefik, we've been using the requestAcceptGraceTimeout setting (which makes the Traefik app wait for a specified number of seconds before starting to terminate connections) to get around this. The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.
This is a pretty gross thing to have to do though. I'm not sure how to get around it while still maintaining ip mode.
given the pods is suddenly killed. I don't think there is a good way to work around it with ALB before they introduces server-side retries(i heard they are working on that feature).
ideally to solve this, we requires some passive healthChecks which marks targets as unhealthy during normal requests(which is not supported yet for alb)
The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.
馃槺 I would never have guessed this. I'm being ridiculously cautious and sleeping for 90s on a preStop and having an even larger terminationGracePeriodSeconds, so I'm not seeing this, but it's darn good to know.
ideally to solve this, we requires some passive healthChecks which marks targets as unhealthy during normal requests(which is not supported yet for alb)
This would be ideal, since not everything is idempotent and can leverage server-side retries.
Thanks for the input folks.
馃槺 I would never have guessed this. I'm being ridiculously cautious and sleeping for 90s on a
preStopand having an even largerterminationGracePeriodSeconds, so I'm not seeing this, but it's darn good to know.
We have an open issue with AWS asking what's going on here, if I hear anything interesting I'll report back.
What AWS says:
It can take some time (few seconds) for the configuration to be propagated on all the ALB nodes and until the configuration of the ALB nodes is updated,
the ALB nodes will continue to send the traffic to the target as it is healthy and not in deregistration delay as per it.
Recommendation is to wait for deregistration delay amount of time to begin shutting the target down.
Thanks a lot for sharing.
It does make sense to not have a deregistration delay longer than the time we take to time out requests and terminate our processes.
It's still surprising AWS would register "draining" on the API before there's consensus among ALB nodes that they are indeed draining.
The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.
Although I've not seen the aforementioned happen, there's a different scenario wherein the AIC itself is slow to update ALB about ingress changes. This can happen if the queue that AIC uses internally for reconciliation is too deep. We run a cluster with about 400 ALBs and all of them use target-type: ip. The delay gets worse when a genuine ingress update happens during the course of a AIC background sync which by default happens every hour. This delays an ingress update as the queue length would be too deep (#ALBs in the worst case).
AIC also does a reconcile of all ingresses on any node event. So if a node is added/removed from the cluster, it triggers a reconcile for all ingresses. IIUC this is not required for ingresses using target-type: ip as a node update event is irrelevant.
The solution for this would be either to have a higher max-concurrent-reconciles value or have namespace specific AICs. Either solutions however can cause an API throttling at AWS if there are a lot of ingress resources.
@jijotj
You should try out our v2.0.0 release, which greatly improved the reconcile time for Targets. (also, node change only impacts instance TargetGroups :D). TIP: If you have large number of TargetGroups, you can improve the --targetgroupbinding-max-concurrent-reconciles flag.
@omnibs I think this can only be solved with server-side retries if the backend turn-off suddenly. Will close this issue and track it in the centralized ALB feature request: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1571
Feel free to reopen if we have more questions.
Most helpful comment
@jijotj
You should try out our v2.0.0 release, which greatly improved the reconcile time for Targets. (also, node change only impacts instance TargetGroups :D). TIP: If you have large number of TargetGroups, you can improve the
--targetgroupbinding-max-concurrent-reconcilesflag.@omnibs I think this can only be solved with server-side retries if the backend turn-off suddenly. Will close this issue and track it in the centralized ALB feature request: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1571
Feel free to reopen if we have more questions.