Aws-load-balancer-controller: Are container restarts with 0 downtime possible with target-type: ip?

Created on 25 Sep 2020 · 8Comments · Source: kubernetes-sigs/aws-load-balancer-controller

We're using alb.ingress.kubernetes.io/target-type: ip, which makes the ALB target Pods directly. We've had a few container restarts, sometimes bc of OOM on memleaky applications, sometimes because of an application panicking, and every time restarts cause big 5xx spikes.

The spikes make sense. AIC reacts in 3 seconds to any event, in our experience, and that's enough for 2.5k requests to hit a dead container in our application.

I'm asking this question trying to learn how folks deal with this, or whether folks don't use target-type: ip at all when reliability is important.

I'm going to share the options I came up with to avoid this situation, with their downsides:

Eliminate container restarts
- Write CronJob that monitors memory usage and kubectl deletes Pods before they hit 100% memory usage
- 👎 I'm reimplementing functionality I'd expect k8s to deliver (enforcing memory limits) because k8s is doing a poor job
- Put panicky applications behind something like Phusion Passenger, which handle never routing traffic to processes that are gone, restarting them, etc.
- 👎 I'm reimplementing functionality I'd expect k8s to deliver (process monitoring) because k8s is doing a poor job
- 👎 I'm forced to run more than one process per Pod, to load balance between them, and so use larger Pods than I'd have liked
Don't drive traffic to application Pods directly
- Use NodePort and live with iptables load balancing
- 👎 I've heard iptables load balancing is pretty bad
- 👎 I won't be able to use Least Outstanding Requests algorithm
- 👋 I'm not even sure Endpoints get updated quickly enough during container restarts, so I'm not sure this really solves the problem
- Replace aws-alb-ingress-controller with Zalando's and use their skipper ingress
- 👎 The skipper process is an additional point of failure
- 👎 Increases node-to-node roundtrips bc skipper in one node can route traffic to applications in another node
- 👎 Removes our ability to use Least Outstanding Requests because skipper only features round robin
- Use Istio, which can also retry requests
- 👎 Istio doesn’t specify whether its retries go to a different endpoint than the previously failed one, like Skipper’s do
- 👎 Istio doesn’t care about idempotency in retries
- 👎 Introduces a few more points of failure (istio proxy, istio’s envoy sidecar)
- 👎 Increases node-to-node roundtrips
- 👎 We’ve seen inconsistent state scenarios with Istio in 2019, where one of 5 proxies will keep stale values and won’t self-heal, and will keep forwarding requests to dead endpoints until we manually identify and kill the inconsistent proxy pod. Don't wanna live through that again.

Source

omnibs

Most helpful comment

@jijotj
You should try out our v2.0.0 release, which greatly improved the reconcile time for Targets. (also, node change only impacts instance TargetGroups :D). TIP: If you have large number of TargetGroups, you can improve the --targetgroupbinding-max-concurrent-reconciles flag.
@omnibs I think this can only be solved with server-side retries if the backend turn-off suddenly. Will close this issue and track it in the centralized ALB feature request: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1571

Feel free to reopen if we have more questions.

M00nF1sh on 27 Oct 2020

👍2 🎉1

All 8 comments

Specifically when running Traefik, we've been using the requestAcceptGraceTimeout setting (which makes the Traefik app wait for a specified number of seconds before starting to terminate connections) to get around this. The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.

This is a pretty gross thing to have to do though. I'm not sure how to get around it while still maintaining ip mode.

AirbornePorcine on 30 Sep 2020

given the pods is suddenly killed. I don't think there is a good way to work around it with ALB before they introduces server-side retries(i heard they are working on that feature).
ideally to solve this, we requires some passive healthChecks which marks targets as unhealthy during normal requests(which is not supported yet for alb)

M00nF1sh on 30 Sep 2020

👍1

The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.

😱 I would never have guessed this. I'm being ridiculously cautious and sleeping for 90s on a preStop and having an even larger terminationGracePeriodSeconds, so I'm not seeing this, but it's darn good to know.

ideally to solve this, we requires some passive healthChecks which marks targets as unhealthy during normal requests(which is not supported yet for alb)

This would be ideal, since not everything is idempotent and can leverage server-side retries.

Thanks for the input folks.

omnibs on 30 Sep 2020

😱 I would never have guessed this. I'm being ridiculously cautious and sleeping for 90s on a preStop and having an even larger terminationGracePeriodSeconds, so I'm not seeing this, but it's darn good to know.

We have an open issue with AWS asking what's going on here, if I hear anything interesting I'll report back.

AirbornePorcine on 30 Sep 2020

❤1

What AWS says:

It can take some time (few seconds) for the configuration to be propagated on all the ALB nodes and until the configuration of the ALB nodes is updated,
the ALB nodes will continue to send the traffic to the target as it is healthy and not in deregistration delay as per it.
Recommendation is to wait for deregistration delay amount of time to begin shutting the target down.

AirbornePorcine on 1 Oct 2020

Thanks a lot for sharing.

It does make sense to not have a deregistration delay longer than the time we take to time out requests and terminate our processes.

It's still surprising AWS would register "draining" on the API before there's consensus among ALB nodes that they are indeed draining.

omnibs on 1 Oct 2020

The issue seems to be that when an IP shows as "draining" in the ALB, this isn't actually true for 5-15s. Draining pods still receive traffic for that period.

Although I've not seen the aforementioned happen, there's a different scenario wherein the AIC itself is slow to update ALB about ingress changes. This can happen if the queue that AIC uses internally for reconciliation is too deep. We run a cluster with about 400 ALBs and all of them use target-type: ip. The delay gets worse when a genuine ingress update happens during the course of a AIC background sync which by default happens every hour. This delays an ingress update as the queue length would be too deep (#ALBs in the worst case).

AIC also does a reconcile of all ingresses on any node event. So if a node is added/removed from the cluster, it triggers a reconcile for all ingresses. IIUC this is not required for ingresses using target-type: ip as a node update event is irrelevant.

The solution for this would be either to have a higher max-concurrent-reconciles value or have namespace specific AICs. Either solutions however can cause an API throttling at AWS if there are a lot of ingress resources.