Aws-load-balancer-controller: ALB sending requests to pods after ingress controller deregisters them leading to 504s

Created on 6 Nov 2019  路  25Comments  路  Source: kubernetes-sigs/aws-load-balancer-controller

I have the following ingress controller defined with the relevant bits being that it is target-type ip with a 30 second deregistration_delay.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: staging-ingress
  namespace: "default"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: REDACTED
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01
    alb.ingress.kubernetes.io/healthcheck-path: /up
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=30,access_logs.s3.enabled=true,access_logs.s3.bucket=REDACTED,access_logs.s3.prefix=REDACTED
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
spec:
  rules:
    - host: REDACTED
      http:
        paths:
          - path: /*
            backend:
              serviceName: ssl-redirect
              servicePort: use-annotation
          - path: /
            backend:
              serviceName: graphql-api
              servicePort: 80

When I delete a pod, either manually or as part of a rolling deploy, I see 504s returned from ALB. 504s are returned when ALB cannot form a connection to its target within 10s. Here is one such message from the ALB logs:

https 2019-11-06T01:35:39.438256Z app/bd528925-default-stagingin-1b90/1a99d4560435cd18 54.248.220.14:24888 172.18.158.197:8080 -1 -1 -1 504 - 229 303 "GET REDACTED:443/up HTTP/1.1" "Amazon-Route53-Health-Check-Service (ref af4d4b0b-e40c-4f18-a18f-12f472889080; report http://amzn.to/1vsZADi)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587 "Root=1-5dc22361-b0d66bf23f845202c2c390ec" "REDACTED" "arn:aws:acm:us-east-1:REDACTED:certificate/REDACTED" 3 2019-11-06T01:35:29.436000Z "forward" "-" "-" "172.18.158.197:8080" "-"

There's a lot going on there, but the important part is that the request is received at 2019-11-06T01:35:29.436000Z and the error is emitted 10s later at 2019-11-06T01:35:39.438256Z.

I investigated the ingress controller logs and I can see that the pod in question, 172.18.158.197:8080, is deregistered at 2019-11-06T01:35:25.891546Z 4 seconds prior to when the above request is received.

I1106 01:35:25.891546       1 targets.go:95] default/staging-ingress: Removing targets from arn:aws:elasticloadbalancing:us-east-1:REDACTED:targetgroup/bd528925-7628515b83cd9ca7cdb/50afd948cb0a6587: 172.18.158.197:8080

My understanding is that once a target is set to "deregistering" ALB will not forward it anymore requests. It's unclear to me how this requests seems to be breaking that rule - any thoughts?

lifecyclrotten

Most helpful comment

Checking back in here. I think I was able to resolve this issue with the preStop sleep workaround to achieve 100% availability though deployments.

Couple things I changed from my initial config:

# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60

# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
  preStop:
    exec:
      command: ["sleep", "25"]

Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds of 60s allows that whole shutdown process to happen without killing the pod.

I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.

All 25 comments

I should add that I've implemented a preStop delay on my pod to allow for deregistration to occur prior to pod shut down. This seems to be functioning correctly. I can see the deregistration in the controller logs, and several seconds later I see a SIGTERM on my pod. The issue seems to be between ALB and the ingress controller.

I'm definitely experiencing the same thing. Powershell preStop commands don't seem to like sleeping first either so had to really rig that up.

Is there any insight as to why terminated pods still get sent requests? Happens if the type is IP or instance.

@jorihardman is this happening every single time you update a deployment? Or only sometimes.

@nicholasgcoles To be honest when I was investigating this issue, my sample size was only about 5 deployments, each rolling 4 pods. I saw at least 1 504 for each of these. I didn't have constant traffic for each deploy, so honestly, it wasn't a perfect experimental setup. Still, the logs seemed to me to be a smoking gun for what I would expect to be an impossible situation, so I figured it was worth raising the question. Happy to investigate further with some guidance.

This looks similar to #905

We are able to reproduce this 100% of time with target-type as ip. We added 30 seconds sleep delay for preStop which resolved 502s. However, 504s happen even after target has been removed from ALB. We have verified that 504s are caused by ALB forwarding requests to target that was removed.

Checking back in here. I think I was able to resolve this issue with the preStop sleep workaround to achieve 100% availability though deployments.

Couple things I changed from my initial config:

# Extend the pods shutdown grace period from the default of 30s to 60s.
# This goes in the pod template spec.
terminationGracePeriodSeconds: 60

# Increase the sleep before SIGTERM to 25s. I had this as 5s previously and it wasn't enough.
lifecycle:
  preStop:
    exec:
      command: ["sleep", "25"]

Extending the sleep time allows ALB to send a few requests even after the dereregistration time without the pod rejecting them. My pod needs max 30s to gracefully answer all requests and terminate, so the 25s sleep + 30s = 55s. The extended terminationGracePeriodSeconds of 60s allows that whole shutdown process to happen without killing the pod.

I used https://github.com/JoeDog/siege to send constant concurrent requests to the load balancer while deploying and achieved 100% availability.

I'm going to close this issue as it seems to be an AWS problem and not related to the controller. The workaround is to add sufficient preStop sleep to allow ALB to complete deregistration. I did 25s which might be overkill but seemed to mitigate the problem.

Previous testing was with our staging environment. When I shipped the changes to production, I still see 504s from ALB sending requests to terminating pods.

Agreed. I'm running windows containers but all of the theory is the same. Actually using a 60s sleep at this point and still getting 504s from the ALB during rollout.

            preStop:
              exec:
                command:
                - powershell.exe
                - -c
                - echo "Stopping container"; Start-Sleep 60; Stop-Website "Main"

Strangely, I haven't gotten a 504 during deploys for the last 48 hours and I've done 3 or 4 deploys at peak traffic times. I did tweak the readiness checks on my pod to make them ready quicker (lowered initialDelaySeconds). Perhaps that has allowed them to become healthy on the LB faster.

Naturally, as soon as I signal success, I see another spurt of 502s. There's definitely still a deregistration race condition here. This time I got 502s due to a pod terminating prior to being deregistered. This seems to be pretty non-deterministic at this point. Sometimes alb-ingress-controller keeps up with changing targets, and sometimes it doesn't.

Any ETA on this?

For those who use the setup:

  • ALB
  • AWS CNI
  • ALB Ingress Controller
  • Linkerd v2

...here's an example of how to avoid the problem:

https://github.com/linkerd/linkerd2/issues/3747#issuecomment-557593979

I am not an expert in ALB internals but I feel it happens because of cross-region nature (and, maybe, multi-tenant). It takes time to propagate changes to all instances, so ALB will send some traffic for a while even after a target is requested to be deregister. So the overall "sleep" configuration for preStop hook should be calculated based on:

  • time for ALB Ingress Controller to send the deregistration API call, especially if AWS WAF is in play
  • Deregistration Delay configured for the ALB
  • some additional time for ALB to propagate all the changes (from my tests it takes up to 5 seconds, but it can be really different in your setup)
  • and after the main container exited, some additional time (yet another preStop hook) specifically for sidecar containers such as Istio, Linkerd or any other to keep internal network up till the very end.

Have you guys tried --feature-gates=waf=false? This fixed the delay issue for us. You can see in following graph, most of retires are for waf[1] which seems to backup the ratelimit queue. You can collect queue related metrics by adding MetricsBindAddress: ":8080" here. One can see longest running process is reaching upwards of 300 sec.

Screenshot from 2020-01-21 16-01-26

After disabling waf, longest running process was down to max of 5 sec (still needs sleep of 10 sec in prestop).

I have verified that the issue exists in v1.1.5. Let me know if you need any further info or help fixing this.
[1] ingress aws api metrics collected with prometheus.io/port: '10254'

@aditya-pr , yes, and it is not enough to get the smooth rolling update.

@aditya-pr Amazing dashboard you've got there. Wish I had that when I was digging into this issue. To clarify, are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

are you saying that v1.1.5 still experiences WAF queue lag even with no WAF rules?

Yes, there is lag even after disabling waf. There are other aws api calls which are still being retried. Adding all endpoints of an ingress to queue whenever any endpoint updates must be making this worse. Deregistration delay of even 5 seconds means elb is sending to a terminated pod, ie 504 Gateway Timeout error.
Screenshot from 2020-02-05 14-13-14

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

My first community post! Wanted to see if I could help by playing with a number of settings in my lab over the weekend. The below config seemed to address all packet loss with my Apache Benchmark setup, I tried to annotate the config with some thoughts on why this might be happening. I would appreciate any feedback if I missed anything. Hope you find it helpful!

# This was made for lab testing, tried to point out settings that should be changed.
# Adding full config for ease of testing using NGINX.

apiVersion: v1   
kind: Service
metadata:
  name: nginx-service
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  type: NodePort
  selector:
    app: nginx
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx-ingress
  annotations:
  # Running IP mode as it makes the cluster autoscaler taint vs. no schedule issue easier for me to address
  # I beleive CA plays a role in the 502's and 504's 
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/scheme: internet-facing
spec:
  rules:
    - http:
        paths:
          - backend:
              serviceName: nginx-service
              servicePort: 80
            path: /*
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  # I found it important to have a good base of pods, when I had 1,000 connection a sec using the ab testing tool
  # HPA used vs. setting replicas. HPA was set at 50% to stay head of traffic scale up with CA since that takes time.
  minReplicas: 5
  maxReplicas: 50
  targetCPUUtilizationPercentage: 50
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
# Replicas taken out here
# WARNING - The deployment must be new and not patched for this to work, or replicas will go to one by default
  strategy:
  # The readiness gate seems to take a bit longer to come on line (15-30sec)
  # I bumped the max surge to adress however this will have a resource and IP usage impact, use wisely
    rollingUpdate:                                                                                                                                        
      maxSurge: 50%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: nginx
  # Readiness gate feature is a important part of this, I think the ReplicaSets were coming up to fast
  # My understanding is this feature works for new pods coming up only! Still need sleeps preStops for spin down
    spec:
      readinessGates:
      - conditionType: target-health.alb.ingress.k8s.aws/nginx-ingress_nginx-service_80
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
  # 250m for testing to eat up the m5.large boxes. 200m ran the nodes hot at 98% util. 
  # Limits set to requests so resource eviction doesn't happen at worst time when the box is overloaded
        resources:
          limits:
            cpu: 250m
            memory: 128Mi
          requests:
            cpu: 250m
            memory: 128Mi
  # Notice sleep timer on before soft shutdown of nginx, timeout of 60 for ALB,and -s quit for soft shutdown of NGINX
  # This seems to work for basic scale-in of pods. Keep in mind preStop is per container
  # If testing with Envoy sidecars, you'll need something there. 
        lifecycle:
          preStop:
            exec:
              command: ["/bin/bash","-c","/bin/sleep 60; /usr/sbin/nginx -s quit"]
  # I think service controller loop is 100 secs for LB drain, wanted a bit of a buffer. 
      terminationGracePeriodSeconds: 120

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings