Aws-load-balancer-controller: ELB returns 502s for requests sporadically after configuring through ALB ingress controller

Created on 8 Aug 2019  路  4Comments  路  Source: kubernetes-sigs/aws-load-balancer-controller

Hello,

Thanks for developing this controller. I have been using it to expose a number of services from an EKS cluster. I'm running into an issue and hopeful that someone could point me in the right direction or offer a debugging approach.

Configuration

I have a service foo-service that is a NodePort service. I have installed ALB and set it up to use my AWS keys (rather than the IAM RBAC configuration option).

Here's my service:

apiVersion: v1
kind: Service
metadata:
  name: foo-service
spec:
  type: NodePort
  ports:
  - name: service
    port: 80
    protocol: TCP
    targetPort: 3000
  selector:
    app: foo

Here is my ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/subnets: "subnet-foo,subnet-bar"
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
    alb.ingress.kubernetes.io/tags: Component=foo,Creator=Robert Quinlivan
    alb.ingress.kubernetes.io/scheme: internet-facing
  name: foo-ingress
spec:
  rules:
    - http:
        paths:
          - backend:
              serviceName: foo-service
              servicePort: 80
            path: /*
      host: foo.mycompany.com

Result

After I apply this, I see a ELB hostname pop up in the ingress. After I apply the Route53 config to point foo.mycompany.com to that host name it works great. However, I get a large number of 502 "Bad Gateway" responses, seemingly at random, that render the service unusable.

I can successfully make a request to foo.mycompany.com that will return successfully about half the time. Half the time it returns a 502. I am reasonably sure it isn't a 502 bubbling up from the service itself, because if I port-forward to it (e.g. kubectl port-forward service/foo-server 9000:80) it works fine. In addition, the 502 response has the following headers which suggest it is indeed the ELB that is causing the 502:

< HTTP/1.1 502 Bad Gateway
HTTP/1.1 502 Bad Gateway
< Server: awselb/2.0
Server: awselb/2.0
< Date: Thu, 08 Aug 2019 15:48:54 GMT
Date: Thu, 08 Aug 2019 15:48:54 GMT
< Content-Type: text/html
Content-Type: text/html
< Content-Length: 138
Content-Length: 138
< Connection: keep-alive
Connection: keep-alive

Conclusion

It would appear that the ALB controller did not configure the ELB correctly, or there is some configuration issue between ELB and Kubernetes that needs to be resolved. I don't see anything very useful from Cloudwatch metrics, just a verification that the load balancer is indeed sending a lot of 502s.

Any idea where to go from here?

Thanks

Most helpful comment

I ran across the same issue running on EKS. Loading a site will split requests between 200 and 502.

Solved it including annotation alb.ingress.kubernetes.io/target-type: ip for Ingress resource.

All 4 comments

A 502 half the time might indicate that one of the nodes (assuming a 2 node cluster) is unhealthy for some reason. Not sure how this works in EKS but in an kops on EC2 world you can get this behaviour. A request that hits the node that the Pod actually resides on will succeed, whereas a request that hits a node where the Pod isn't hosted will fail. This was down to the kube-proxy not being able to forward to the Pod on the other node and was down to some missing security group rules. I also ran into some edge case where I needed to disable SRC/DST check on ENIs, but I don't think that's the case here.

@allanyung Can you post the missing rules here?

I ran across the same issue running on EKS. Loading a site will split requests between 200 and 502.

Solved it including annotation alb.ingress.kubernetes.io/target-type: ip for Ingress resource.

In summary, to achieve zero downtime deployment, you need

under instance mode

  1. prestop hook that sleeps for pods
  2. disable HTTP keep-alive

under IP mode

  1. prestop hook that sleeps for pods
  2. use readinessGate.

馃槃

closing this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brylex418 picture brylex418  路  4Comments

mgoodness picture mgoodness  路  5Comments

JakubJecminek picture JakubJecminek  路  5Comments

ghostsquad picture ghostsquad  路  4Comments

khacminh picture khacminh  路  3Comments