Aws-load-balancer-controller: getting 502 Bad Gateway on eks aws-alb-ingress

Created on 22 Jul 2019 · 28Comments · Source: kubernetes-sigs/aws-load-balancer-controller

getting 502 Bad Gateway on eks aws-alb-ingress

lifecyclrotten

Source

nandhyala

👍32

Most helpful comment

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

prcongithub on 24 Oct 2019

👍7

All 28 comments

kind: Service
apiVersion: v1
metadata:
labels:
app: xxxx-cam-sdk
env: dev
name: xxxx-cam-sdk-service
namespace: xxxx-cam-sdk
spec:
type: NodePort
ports:
- port: 443
targetPort: 8000
selector:
app: xxxx-cam-sdk
env: dev

#

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "xxxx-cam-sdk-ingress"
namespace: "xxxx-cam-sdk"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/subnets: subnet-0ffd85d9d967xxxx, subnet-0fcb01b837xxxxx
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/backend-protocol: HTTPS
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-1:67xxxxxx:certificate/xxxxx82-fc71-40ff-b625-cc3d6585aad2
labels:
app: xxxx-cam-sdk
env: dev
spec:
rules:
- http:
paths:
- path: /*
backend:
serviceName: "xxxx-cam-sdk-service"
servicePort: 443

nandhyala on 22 Jul 2019

Can you please review above service & ingress yaml values let me know what is the issue for 502 bad gateway error

nandhyala on 22 Jul 2019

same issue

pdeva on 22 Jul 2019

I'm having the same issue here, below is my ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: "2048-ingress"
  namespace: "2048-game"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/backend-protocol: HTTPS
    # ACM certificate ARN for your SSL domain
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:xxxx:certificate/xxxx
  labels:
    app: 2048-ingress
spec:
  rules:
    - host: 2048.cpaface.biz
      http:
        paths:
          - path: /*
            backend:
              serviceName: "service-2048"
              servicePort: 80

host "2048.cpaface.biz" could resolve to it's correct elb dns

nslookup 2048.cpaface.biz
Server:     10.1.1.1
Address:    10.1.1.1#53

Non-authoritative answer:
2048.cpaface.biz    canonical name = ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com.
Name:   ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com
Address: 54.227.159.166
Name:   ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com
Address: 3.209.166.62

the ALB rules looks good to me as well
Screen Shot 2019-09-11 at 8 33 50 pm

but I still get 502 Bad Gateway

curl -k -v  https://2048.cpaface.biz
* Rebuilt URL to: https://2048.cpaface.biz/
*   Trying 54.227.159.166...
* TCP_NODELAY set
* Connected to 2048.cpaface.biz (54.227.159.166) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=cpaface.biz
*  start date: Aug 15 00:00:00 2019 GMT
*  expire date: Sep 15 12:00:00 2020 GMT
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fa49600c600)
> GET / HTTP/2
> Host: 2048.cpaface.biz
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 502
< server: awselb/2.0
< date: Wed, 11 Sep 2019 12:37:10 GMT
< content-type: text/html
< content-length: 138
<
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>
* Connection #0 to host 2048.cpaface.biz left intact

email2liyang on 11 Sep 2019

Same here. Is there any progress on this issue? Thanks

gmolaire on 3 Oct 2019

👍5

Is there a workaround for this problem?

robsonpeixoto on 18 Oct 2019

Same issue, even other types of load balancers (Classic, NLB, ALB).
Will update, if I find a resolution.

abhilash-thumma on 19 Oct 2019

502 means that what sits behind the load balancer is returning an abnormal response. Are you able to call your back-end from its container with a simple curl?

In my case, my container was not setup properly to receive requests.

gmolaire on 19 Oct 2019

Yes, I am able to call from the Container's IP

abhilash-thumma on 20 Oct 2019

Are you able to call it through the service definition being used by the ingress?

gmolaire on 20 Oct 2019

Same issue. There are a small number of 502 occur.

khacminh on 24 Oct 2019

👍2

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

prcongithub on 24 Oct 2019

👍7

Can you show the logs of the container during the 502 period?

gmolaire on 25 Oct 2019

Will try to reproduce this and get back with the logs soon.

prcongithub on 31 Oct 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Jan 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 28 Feb 2020

/remove-lifecycle rotten

vicky997 on 1 Mar 2020

Any update? I'm facing the same issue... In DigitalOcean, a simple implementation of readiness probe worked fine, but in AWS it doesn't....

Serrvosky on 16 Mar 2020

@Serrvosky readiness probe is merged into master, we'll do a release this week

M00nF1sh on 16 Mar 2020

@M00nF1sh I check today, and a new release comes out today... Am I right? How can I check if this problem was fixed? Do I have to update my AWS k8s cluster?

Thanks

Serrvosky on 23 Mar 2020

@Serrvosky
Yes, you need to upgrade to v1.1.6:

first, update the IAM permissions https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/v1.1.6/docs/examples/iam-policy.json
update controller image to be docker.io/amazon/aws-alb-ingress-controller:v1.1.6

BTW, with podReadiness Probe on, you still need https://github.com/M00nF1sh/ReInvent2019CON310R/commit/cc4016c5e32cb221d1637abc0e4e45c49b245b7d, and set a podReadiness probe on your deployment manually(we will use webhook to automatically do it in the future)

M00nF1sh on 23 Mar 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 21 Jun 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 21 Jul 2020

+1 we need fix for this

deskera-ci on 21 Jul 2020

hi we are facing the same issue, this is happening randomly, any help on how to fix this would be really great. Thanks

aksharj on 2 Aug 2020

👍2

Same here, we ensured that the application keep-alive timeout is longer than ALB connection timeout.
However it happens randomly every 1 or 2 weeks.

rayway30419 on 27 Aug 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Sep 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.