Hi! First of all, I appreciate the community and all their work on this project, it is very helpful and a good solution to route directly to pods from an ALB.
However, during testing, I've noticed intermittent 502/503s during deploys of our statefulset. My current hypothesis is that during a deploy, the statefulset controller kills a pod in need of updates, and there is latency between this happening and the alb ingress controller updating the alb target to draining. During this delay, requests are sent to the terminating pod and return 502 (our nginx sidecar) and/or 503 (aws alb).
Has anyone else seen this problem, and potentially have a solution for it? Ideally we'd remove the pod from the alb target group before killing the pod, if this is in fact what is happening.
I have the following Service and Ingress:
kind: Service
apiVersion: v1
metadata:
name: svc-headless
namespace: dev
spec:
clusterIP: None
selector:
app: svc
ports:
- name: http
port: 9000
Ingress
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: svc-external
namespace: dev
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxxx,sg-yyyyyyyyyyy
alb.ingress.kubernetes.io/healthcheck-interval-seconds: 5
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: 3
alb.ingress.kubernetes.io/success-codes: 200,201,401
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:XXXXXXXXXX:certificate/uuid
alb.ingress.kubernetes.io/subnets: subnet-aaaaa,subnet-bbbbb,subnet-cccc
labels:
app: svc
spec:
rules:
- http:
paths:
- path: /*
backend:
serviceName: ssl-redirect
servicePort: use-annotation
- path: /*
backend:
serviceName: svc-headless
servicePort: 9000
Hi,
This is indeed what happened.
The best way to work around this for now is to use NodePort service with "mode instance" for our ingress. (You can create a separate nodePort service along with your headless service).
An more robust way might be support this with ReadinessGate and dynamicAdmissionControllers, i haven't have a deeper thought about this though, will do some prototyping to see whether it works 馃槃
Hi @M00nF1sh, thanks for the response.
That would work, however it gets us back to the exact problem I'm trying to solve. We have a large amount of instances, in various node groups. This quickly balloons the amount of attached instances to the target group. The pods we'd like to direct traffic to belong to a small instance group -- so this would work, if we could select those ec2 instances (k8s nodes) directly. Is there a way to filter or limit which cluster nodes get attached (via kubernetes node label, ec2 tag, or otherwise) ?
@justinwalz
It's not supported for now. I can make a change to support alpha.service-controller.kubernetes.io/exclude-balancer, but that will require you to tag all nodes you don't want with that tag, will that be acceptable?
@M00nF1sh That would work, we can add a node label to exclude a fleet of instances for specific service ALBs.
Would it be possible to also have the inverse, maybe alpha.service-controller.kubernetes.io/include-balancer and do a union of the matching nodes between the whitelist and blacklist?
@justinwalz It's possible(and make sense to me) to have the inverse, but i tend to not have it since it's not in k8s core. By only having exclude-balancer let us remains more compatible with k8s core.
Got it - no problem. Thanks for the help on this!
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle rotten
I've faced a similar issue running aws-alb-controller:v1.1.3 with ip mode but I found it hard to switch to the instance mode + tagged instances approach due to the current setup limitations. Please, advice is there any easy ways to deal with this lag between a pod being killed and being deregistered from a load balancer?
I am facing a similar issue. My kubernetes services scale up when the number of requests per second reach a certain value. But I get random 502 errors sometime during the peak times.
apiVersion: extensions/v1beta1
kind: Deployment
spec:
replicas: 2
minReadySeconds: 50
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 50%
spec:
containers:
resources:
requests:
cpu: 1900m
memory: 2500Mi
limits:
cpu: 1900m
memory: 2500Mi
envFrom:
- secretRef:
name: kube-auth-api
readinessProbe:
httpGet:
path: /status
port: 3001
initialDelaySeconds: 60
periodSeconds: 15
timeoutSeconds: 15
livenessProbe:
httpGet:
path: /status
port: 3001
initialDelaySeconds: 60
periodSeconds: 15
timeoutSeconds: 15
imagePullSecrets:
- name: awsecr-cred
I get random 502 errors even when all the containers are healthy and are not even restarting.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
We're getting this every single deploy. What are the workarounds available?
We have a service like:
apiVersion: v1
kind: Service
metadata:
name: fortio
annotations:
alb.ingress.kubernetes.io/healthcheck-path: /
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
type: NodePort
selector:
app: fortio
@douglaz See this thread which covers the same issue with a couple of solutions: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064
tldr:
--feature-gates=waf=false to alb-ingress-controller container args. Right now the controller makes WAF requests for every deploy, and AWS throttling these requests can cause a delay in updating targets. If you're not using waf, skipping it entirely prevents these delays.@jorihardman Could you try if the pod readiness gates feature I added solves the problem? You would need to build a custom docker image from master since it's not released yet:
https://github.com/kubernetes-sigs/aws-alb-ingress-controller/blob/master/docs/guide/ingress/pod-conditions.md
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
@douglaz See this thread which covers the same issue with a couple of solutions: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1064
tldr:
--feature-gates=waf=falseto alb-ingress-controller container args. Right now the controller makes WAF requests for every deploy, and AWS throttling these requests can cause a delay in updating targets. If you're not using waf, skipping it entirely prevents these delays.