I have the following ingress controller defined with the relevant bits being that it is target-type ip with a 30 second deregistration_delay.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: staging-ingress
namespace: "default"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: REDACTED
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
alb.ingress.kubernetes.io/healthcheck-path: /up
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30,access_logs.s3.enabled=true,access_logs.s3.bucket=REDACTED,access_logs.s3.prefix=REDACTED
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
spec:
rules:
- host: REDACTED
http:
paths:
- path: /*
backend:
serviceName: ssl-redirect
servicePort: use-annotation
- path: /
backend:
serviceName: graphql-api
servicePort: 80
When I delete a pod, either manually or as part of a rolling deploy, I see 504s returned from ALB. 504s are returned when ALB cannot form a connection to its target within 10s. Here is one such message from the ALB logs:
https 2019-11-06T01:35:39.438256Z app/bd528925-default-stagingin-1b90/1a99d4560435cd18 54.248.220.14:24888 172.18.158.197:8080 -1 -1 -1 504 - 229 303 "GET REDACTED:443/up HTTP/1.1" "Amazon-Route53-Health-Check-Service (ref af4d4b0b-e40c-4f18-a18f-12f472889080; report http://amzn.to/1vsZADi)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587 "Root=1-5dc22361-b0d66bf23f845202c2c390ec" "REDACTED" "arn:aws:acm:us-east-1:REDACTED:certificate/REDACTED" 3 2019-11-06T01:35:29.436000Z "forward" "-" "-" "172.18.158.197:8080" "-"
There's a lot going on there, but the important part is that the request is received at 2019-11-06T01:35:29.436000Z and the error is emitted 10s later at 2019-11-06T01:35:39.438256Z.
I investigated the ingress controller logs and I can see that the pod in question, 172.18.158.197:8080, is deregistered at 2019-11-06T01:35:25.891546Z 4 seconds prior to when the above request is received.
I1106 01:35:25.891546 1 targets.go:95] default/staging-ingress: Removing targets from arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587: 172.18.158.197:8080
My understanding is that once a target is set to "deregistering" ALB will not forward it anymore requests. It's unclear to me how this requests seems to be breaking that rule - any thoughts?
I should add that I've implemented a preStop delay on my pod to allow for deregistration to occur prior to pod shut down. This seems to be functioning correctly. I can see the deregistration in the controller logs, and several seconds later I see a SIGTERM on my pod. The issue seems to be between ALB and the ingress controller.
I'm definitely experiencing the same thing. Powershell preStop commands don't seem to like sleeping first either so had to really rig that up.
Is there any insight as to why terminated pods still get sent requests? Happens if the type is IP or instance.
@jorihardman is this happening every single time you update a deployment? Or only sometimes.
@nicholasgcoles To be honest when I was investigating this issue, my sample size was only about 5 deployments, each rolling 4 pods. I saw at least 1 504 for each of these. I didn't have constant traffic for each deploy, so honestly, it wasn't a perfect experimental setup. Still, the logs seemed to me to be a smoking gun for what I would expect to be an impossible situation, so I figured it was worth raising the question. Happy to investigate further with some guidance.
This looks similar to #905
We are able to reproduce this 100% of time with target-type as ip. We added 30 seconds sleep delay for preStop which resolved 502s. However, 504s happen even after target has been removed from ALB. We have verified that 504s are caused by ALB forwarding requests to target that was removed.
Checking back in here. I think I was able to resolve this issue with the preStop sleep workaround to achieve 100% availability though deployments.
Couple things I changed from my initial config:
# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60
# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
preStop:
exec:
command: ["sleep", "25"]
Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds of 60s allows that whole shutdown process to happen without killing the pod.
I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.
I'm going to close this issue as it seems to be an AWS problem and not related to the controller. The workaround is to add sufficient preStop sleep to allow ALB to complete deregistration. I did 25s which might be overkill but seemed to mitigate the problem.
Previous testing was with our staging environment. When I shipped the changes to production, I still see 504s from ALB sending requests to terminating pods.
Agreed. I'm running windows containers but all of the theory is the same. Actually using a 60s sleep at this point and still getting 504s from the ALB during rollout.
preStop:
exec:
command:
- powershell.exe
- -c
- echo "Stopping container"; Start-Sleep 60; Stop-Website "Main"
Strangely, I haven't gotten a 504 during deploys for the last 48 hours and I've done 3 or 4 deploys at peak traffic times. I did tweak the readiness checks on my pod to make them ready quicker (lowered initialDelaySeconds). Perhaps that has allowed them to become healthy on the LB faster.
Naturally, as soon as I signal success, I see another spurt of 502s. There's definitely still a deregistration race condition here. This time I got 502s due to a pod terminating prior to being deregistered. This seems to be pretty non-deterministic at this point. Sometimes alb-ingress-controller keeps up with changing targets, and sometimes it doesn't.
Any ETA on this?
For those who use the setup:
...here's an example of how to avoid the problem:
https://github.com/linkerd/linkerd2/issues/3747#issuecomment-557593979
I am not an expert in ALB internals but I feel it happens because of cross-region nature (and, maybe, multi-tenant). It takes time to propagate changes to all instances, so ALB will send some traffic for a while even after a target is requested to be deregister. So the overall "sleep" configuration for preStop hook should be calculated based on:
Have you guys tried --feature-gates=waf=false? This fixed the delay issue for us. You can see in following graph, most of retires are for waf[1] which seems to backup the ratelimit queue. You can collect queue related metrics by adding MetricsBindAddress: ":8080" here. One can see longest running process is reaching upwards of 300 sec.

After disabling waf, longest running process was down to max of 5 sec (still needs sleep of 10 sec in prestop).
I have verified that the issue exists in v1.1.5. Let me know if you need any further info or help fixing this.
[1] ingress aws api metrics collected with prometheus.io/port: '10254'
@aditya-pr , yes, and it is not enough to get the smooth rolling update.
@aditya-pr Amazing dashboard you've got there. Wish I had that when I was digging into this issue. To clarify, are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?
are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?
Yes, there is lag even after disabling waf. There are other aws api calls which are still being retried. Adding all endpoints of an ingress to queue whenever any endpoint updates must be making this worse. Deregistration delay of even 5 seconds means elb is sending to a terminated pod, ie 504 Gateway Timeout error.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
My first community post! Wanted to see if I could help by playing with a number of settings in my lab over the weekend. The below config seemed to address all packet loss with my Apache Benchmark setup, I tried to annotate the config with some thoughts on why this might be happening. I would appreciate any feedback if I missed anything. Hope you find it helpful!
# This was made for lab testing, tried to point out settings that should be changed.
# Adding full config for ease of testing using NGINX.
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
type: NodePort
selector:
app: nginx
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nginx-ingress
annotations:
# Running IP mode as it makes the cluster autoscaler taint vs. no schedule issue easier for me to address
# I beleive CA plays a role in the 502's and 504's
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/scheme: internet-facing
spec:
rules:
- http:
paths:
- backend:
serviceName: nginx-service
servicePort: 80
path: /*
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: nginx-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
# I found it important to have a good base of pods, when I had 1,000 connection a sec using the ab testing tool
# HPA used vs. setting replicas. HPA was set at 50% to stay head of traffic scale up with CA since that takes time.
minReplicas: 5
maxReplicas: 50
targetCPUUtilizationPercentage: 50
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
# Replicas taken out here
# WARNING - The deployment must be new and not patched for this to work, or replicas will go to one by default
strategy:
# The readiness gate seems to take a bit longer to come on line (15-30sec)
# I bumped the max surge to adress however this will have a resource and IP usage impact, use wisely
rollingUpdate:
maxSurge: 50%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: nginx
# Readiness gate feature is a important part of this, I think the ReplicaSets were coming up to fast
# My understanding is this feature works for new pods coming up only! Still need sleeps preStops for spin down
spec:
readinessGates:
- conditionType: target-health.alb.ingress.k8s.aws/nginx-ingress_nginx-service_80
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
# 250m for testing to eat up the m5.large boxes. 200m ran the nodes hot at 98% util.
# Limits set to requests so resource eviction doesn't happen at worst time when the box is overloaded
resources:
limits:
cpu: 250m
memory: 128Mi
requests:
cpu: 250m
memory: 128Mi
# Notice sleep timer on before soft shutdown of nginx, timeout of 60 for ALB,and -s quit for soft shutdown of NGINX
# This seems to work for basic scale-in of pods. Keep in mind preStop is per container
# If testing with Envoy sidecars, you'll need something there.
lifecycle:
preStop:
exec:
command: ["/bin/bash","-c","/bin/sleep 60; /usr/sbin/nginx -s quit"]
# I think service controller loop is 100 secs for LB drain, wanted a bit of a buffer.
terminationGracePeriodSeconds: 120
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Checking back in here. I think I was able to resolve this issue with the
preStopsleep workaround to achieve 100% availability though deployments.Couple things I changed from my initial config:
Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended
terminationGracePeriodSecondsof 60s allows that whole shutdown process to happen without killing the pod.I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.