Aws-load-balancer-controller: getting 502 Bad Gateway on eks aws-alb-ingress

Created on 22 Jul 2019  路  28Comments  路  Source: kubernetes-sigs/aws-load-balancer-controller

getting 502 Bad Gateway on eks aws-alb-ingress

lifecyclrotten

Most helpful comment

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

All 28 comments

kind: Service
apiVersion: v1
metadata:
labels:
app: xxxx-cam-sdk
env: dev
name: xxxx-cam-sdk-service
namespace: xxxx-cam-sdk
spec:
type: NodePort
ports:
- port: 443
targetPort: 8000
selector:
app: xxxx-cam-sdk
env: dev

#

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "xxxx-cam-sdk-ingress"
namespace: "xxxx-cam-sdk"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/subnets: subnet-0ffd85d9d967xxxx, subnet-0fcb01b837xxxxx
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/backend-protocol: HTTPS
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-1:67xxxxxx:certificate/xxxxx82-fc71-40ff-b625-cc3d6585aad2
labels:
app: xxxx-cam-sdk
env: dev
spec:
rules:
- http:
paths:
- path: /*
backend:
serviceName: "xxxx-cam-sdk-service"
servicePort: 443

Can you please review above service & ingress yaml values let me know what is the issue for 502 bad gateway error

same issue

I'm having the same issue here, below is my ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: "2048-ingress"
  namespace: "2048-game"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/backend-protocol: HTTPS
    # ACM certificate ARN for your SSL domain
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:xxxx:certificate/xxxx
  labels:
    app: 2048-ingress
spec:
  rules:
    - host: 2048.cpaface.biz
      http:
        paths:
          - path: /*
            backend:
              serviceName: "service-2048"
              servicePort: 80

host "2048.cpaface.biz" could resolve to it's correct elb dns

nslookup 2048.cpaface.biz
Server:     10.1.1.1
Address:    10.1.1.1#53

Non-authoritative answer:
2048.cpaface.biz    canonical name = ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com.
Name:   ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com
Address: 54.227.159.166
Name:   ffb06d37-2048game-2048ingr-6fa0-879326269.us-east-1.elb.amazonaws.com
Address: 3.209.166.62

the ALB rules looks good to me as well
Screen Shot 2019-09-11 at 8 33 50 pm

but I still get 502 Bad Gateway

curl -k -v  https://2048.cpaface.biz
* Rebuilt URL to: https://2048.cpaface.biz/
*   Trying 54.227.159.166...
* TCP_NODELAY set
* Connected to 2048.cpaface.biz (54.227.159.166) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=cpaface.biz
*  start date: Aug 15 00:00:00 2019 GMT
*  expire date: Sep 15 12:00:00 2020 GMT
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fa49600c600)
> GET / HTTP/2
> Host: 2048.cpaface.biz
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 502
< server: awselb/2.0
< date: Wed, 11 Sep 2019 12:37:10 GMT
< content-type: text/html
< content-length: 138
<
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>
* Connection #0 to host 2048.cpaface.biz left intact

Same here. Is there any progress on this issue? Thanks

Is there a workaround for this problem?

Same issue, even other types of load balancers (Classic, NLB, ALB).
Will update, if I find a resolution.

502 means that what sits behind the load balancer is returning an abnormal response. Are you able to call your back-end from its container with a simple curl?

In my case, my container was not setup properly to receive requests.

Yes, I am able to call from the Container's IP

Are you able to call it through the service definition being used by the ingress?

Same issue. There are a small number of 502 occur.

I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.

apiVersion: extensions/v1beta1
kind: Deployment
spec:
  replicas: 2
  minReadySeconds: 50
  revisionHistoryLimit: 10
  strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 50%
    spec:
      containers:
        resources:
           requests:
             cpu: 1900m
             memory: 2500Mi
           limits:
             cpu: 1900m
             memory: 2500Mi
        envFrom:
          - secretRef:
              name: kube-auth-api
        readinessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
        livenessProbe:
          httpGet:
            path: /status
            port: 3001
          initialDelaySeconds: 60
          periodSeconds: 15
          timeoutSeconds: 15
      imagePullSecrets:
      - name: awsecr-cred

I get random 502 errors even when all the containers are healthy and are not even restarting.

Can you show the logs of the container during the 502 period?

Will try to reproduce this and get back with the logs soon.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

/remove-lifecycle rotten

Any update? I'm facing the same issue... In DigitalOcean, a simple implementation of readiness probe worked fine, but in AWS it doesn't....

@Serrvosky readiness probe is merged into master, we'll do a release this week

@M00nF1sh I check today, and a new release comes out today... Am I right? How can I check if this problem was fixed? Do I have to update my AWS k8s cluster?

Thanks

@Serrvosky
Yes, you need to upgrade to v1.1.6:

  1. first, update the IAM permissions https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/v1.1.6/docs/examples/iam-policy.json
  2. update controller image to be docker.io/amazon/aws-alb-ingress-controller:v1.1.6

BTW, with podReadiness Probe on, you still need https://github.com/M00nF1sh/ReInvent2019CON310R/commit/cc4016c5e32cb221d1637abc0e4e45c49b245b7d, and set a podReadiness probe on your deployment manually(we will use webhook to automatically do it in the future)

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

+1 we need fix for this

hi we are facing the same issue, this is happening randomly, any help on how to fix this would be really great. Thanks

Same here, we ensured that the application keep-alive timeout is longer than ALB connection timeout.
However it happens randomly every 1 or 2 weeks.

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghostsquad picture ghostsquad  路  4Comments

brylex418 picture brylex418  路  4Comments

sawanoboly picture sawanoboly  路  5Comments

rootd00d picture rootd00d  路  4Comments

jchoi926 picture jchoi926  路  3Comments