Aws-load-balancer-controller: Leader loss => targets removed and aded back

Created on 23 Jan 2020 · 12Comments · Source: kubernetes-sigs/aws-load-balancer-controller

Hi,

Creating an issue as this happened a second time today and we got logs!

On the 7th of January, we experienced a 10-ish minute outage in a production environment. Checking the logs we did not see anything odd, except a _Leader election loss_ for the ALB Ingress Controller.

When we checked CloudWatch we saw missing data for the ALB during the period we had the outage. We created Support Case 6730882761 which very nicely helped us debug. We were told:

[...] a “DeregisterTargets” API call made at 2020-01-07T21:24:17 UTC
[...] a “RegisterTargets” API call made at 2020-01-07T21:33:52 UTC

That was weird as we saw nothing in the logs about targets being de-registered. Lacking information, we decided to keep an eye and see if something like this happens again.

Today, the 23rd of January, it happened again: a 10-ish minute outage that recovered by itself. This time though we were ready and we investigated. I am posting here the initial results.

For background:

running on EKS 1.13, AMI amazon-eks-node-1.13-v20190906 (ami-08198f90fe8bc57f0)
running in us-east-1
using aws-alb-ingress-controller:v1.1.2 installed though helm, with 3 replicas

Logs from boring pod 1

AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:54:36.773777       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0123 18:55:06.774271       1 :0] kubebuilder/manager "msg"="Failed to get API Group-Resources" "error"="Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout"
F0123 18:55:06.774296       1 main.go:79] Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout



-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:55:54.775199       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0123 18:56:03.467633       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:56:03.467822       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:56:03.467863       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:56:03.467979       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:56:03.468000       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:56:03.468099       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:56:03.468272       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0123 18:56:03.468440       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
E0123 19:11:44.979756       1 leaderelection.go:270] error retrieving resource lock kube-system/ingress-controller-leader-alb: Get https://10.100.0.1:443/api/v1/namespaces/kube-system/configmaps/ingress-controller-leader-alb: read tcp 172.44.43.232:45844->10.100.0.1:443: read: connection timed outs

Logs from boring pod 2

AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:50:13.951048       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0123 18:50:23.952290       1 :0] kubebuilder/manager "msg"="Failed to get API Group-Resources" "error"="Get https://10.100.0.1:443/api?timeout=32s: net/http: TLS handshake timeout"
F0123 18:50:23.952320       1 main.go:79] Get https://10.100.0.1:443/api?timeout=32s: net/http: TLS handshake timeout



-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:50:40.772834       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0123 18:50:50.662068       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:50:50.662235       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:50:50.662279       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:50:50.662396       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:50:50.662422       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:50:50.662509       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:50:50.662696       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0123 18:50:50.662878       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
E0123 19:11:49.075758       1 leaderelection.go:270] error retrieving resource lock kube-system/ingress-controller-leader-alb: Get https://10.100.0.1:443/api/v1/namespaces/kube-system/configmaps/ingress-controller-leader-alb: read tcp 172.44.13.139:52604->10.100.0.1:443: read: connection timed out

Logs from interesting pod 3

AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:56:15.737657       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0123 18:56:45.738240       1 :0] kubebuilder/manager "msg"="Failed to get API Group-Resources" "error"="Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout"
F0123 18:56:45.738266       1 main.go:79] Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout



-------------------------------------------------------------------------------
AWS ALB Ingress controller
  Release:    v1.1.2
  Build:      git-cc1c5971
  Repository: https://github.com/kubernetes-sigs/aws-alb-ingress-controller.git
-------------------------------------------------------------------------------

W0123 18:57:35.765017       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0123 18:57:35.809159       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:57:35.809309       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:57:35.809358       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:57:35.809470       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"loadBalancer":{}}}}
I0123 18:57:35.809496       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"=
I0123 18:57:35.809575       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I0123 18:57:35.809748       1 :0] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="alb-ingress-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{},"status":{"daemonEndpoints":{"kubeletEndpoint":{"Port":0}},"nodeInfo":{"machineID":"","systemUUID":"","bootID":"","kernelVersion":"","osImage":"","containerRuntimeVersion":"","kubeletVersion":"","kubeProxyVersion":"","operatingSystem":"","architecture":""}}}}
I0123 18:57:35.809917       1 leaderelection.go:205] attempting to acquire leader lease  kube-system/ingress-controller-leader-alb...
I0123 18:57:54.171652       1 leaderelection.go:214] successfully acquired lease kube-system/ingress-controller-leader-alb
I0123 18:57:54.271866       1 :0] kubebuilder/controller "level"=0 "msg"="Starting Controller"  "controller"="alb-ingress-controller"
I0123 18:57:54.371990       1 :0] kubebuilder/controller "level"=0 "msg"="Starting workers"  "controller"="alb-ingress-controller" "worker count"=1
...
I0123 18:58:57.493486       1 targets.go:95] default/my-service: Removing targets from arn:aws:elasticloadbalancing:us-east-1:000000000000:targetgroup/b0000004-75000000000000000e8/8800000000000284: 172.44.178.120:8081, 172.44.148.36:8081
I0123 18:58:57.602316       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host '*.a.b.c.d.example.com', adding to Listener
I0123 18:58:57.602333       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host 'my-service.a.b.c.d.example.com', adding to Listener
I0123 18:58:57.602358       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 18:58:57.602369       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 18:58:57.602377       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 18:58:57.602387       1 listener.go:236] default/my-service: Auto-detected and added 4 certificates to listener
...
I0123 19:04:14.298490       1 targets.go:80] default/my-service: Adding targets to arn:aws:elasticloadbalancing:us-east-1:000000000000:targetgroup/b0000004-75000000000000000e8/8800000000000284: 172.44.148.36:8081, 172.44.178.120:8081
I0123 19:04:14.513851       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host '*.a.b.c.d.example.com', adding to Listener
I0123 19:04:14.513869       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host 'my-service.a.b.c.d.example.com', adding to Listener
I0123 19:04:14.513893       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:04:14.513907       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:04:14.513915       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:04:14.513926       1 listener.go:236] default/my-service: Auto-detected and added 4 certificates to listener
... and then back to normal:
I0123 19:10:23.892480       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host '*.a.b.c.d.example.com', adding to Listener
I0123 19:10:23.892498       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host 'my-service.a.b.c.d.example.com', adding to Listener
I0123 19:10:23.892520       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:10:23.892531       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:10:23.892537       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:10:23.892547       1 listener.go:236] default/my-service: Auto-detected and added 4 certificates to listener
...
I0123 19:17:48.708063       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host '*.a.b.c.d.example.com', adding to Listener
I0123 19:17:48.708083       1 listener.go:283] default/my-service: Domain name '*.a.b.c.d.example.com', matches TLS host 'my-service.a.b.c.d.example.com', adding to Listener
I0123 19:17:48.708115       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:17:48.708134       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:17:48.708145       1 listener.go:283] default/my-service: Domain name '*.example.com', matches TLS host 'demo.example.com', adding to Listener
I0123 19:17:48.708164       1 listener.go:236] default/my-service: Auto-detected and added 4 certificates to listener
and so on

So, looking at the logs for pod 3 we see that it:

successfully acquires the leader role
deletes all targets
adds the targets back 5 minutes later

It seems that when the ALB Ingress Controller has a leader loss it successfully comes back but for some reason deletes all the targets.

Is this a bug? Are we missing something? We have been running this exact config for months and we only now started seeing this issue.

I should mention that we're also seeing a ton of restarts either from https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1072 or from:

I0123  1 leaderelection.go:249] failed to renew lease kube-system/ingress-controller-leader-alb: failed to tryAcquireOrRenew context deadline exceeded
F0123  1 main.go:110] leader election lost

Edit: Interestingly enough, not all the Ingresses were affected. Just some.

Source

Vlaaaaaaad

Most helpful comment

@shankarsundaram you should create a support ticket to EKS to address the availability issue(mostly a proactive scale up).

M00nF1sh on 20 Feb 2020

👍2

All 12 comments

I'll try to reproduce this, i guess it might due to some cache sync issue or APIServer issue(endpoints object gone?)
BTW the lose-leader is definitely something wrong with APIServer. If you are using EKS, please cut an customer support ticket and request scale up.

M00nF1sh on 24 Jan 2020

please cut an customer support ticket and request scale up.

Scale up of the cluster or escalation of the ticket? I think you mean the second one, so 6768495881 ticket created. We do have rather spiky usage on that cluster, but we don't see any constant errors or issues.

The initial incident happened in another account on a different cluster and as per my original post, 6730882761 has all the details we have about that incident.

Thank you for looking into this!

Vlaaaaaaad on 24 Jan 2020

Issue happened on another cluster today 😞 Same symptoms, same logs, same everything.

Vlaaaaaaad on 30 Jan 2020

some updates, I cannot reproduce, suspect something wrong with the cluster itself

M00nF1sh on 19 Feb 2020

confirmed its not ALB's ingress controller issues.
The pods are removed from endpoints by kube-controller-manager due to not ready.

{"addresses":[{"ip":"172.44.118.201","nodeName":"xxxxxxxx","targetRef":{"kind":"Pod","namespace":"default","name":"xxxxxx","uid":"47e22bd3-226a-11ea-b3e0-120aafec3483","resourceVersion":"109055499"}}],"notReadyAddresses":[{"ip":"172.44.148.36","nodeName":"xxxxxx","targetRef":{"kind":"Pod","namespace":"default","name":"xxxxxx","uid":"6e2d948a-226a-11ea-b3e0-120aafec3483","resourceVersion":"109053882"}},{"ip":"172.44.178.120","nodeName":"xxxxxxxx","targetRef":{"kind":"Pod","namespace":"default","name":"xxxxxxxxxx","uid":"276cebfe-226a-11ea-b3e0-120aafec3483","resourceVersion":"109053886"}}],"ports":[{"name":"http","port":8081,"protocol":"TCP"}]}

M00nF1sh on 19 Feb 2020

checked more logs, it's the node(ip-172-44-163-23.ec2.internal) cannot report heartbeat to APIServer(due to availability issue at that time due to spike loads), so KCM marked all pods on that node as not ready, thus picked up by controller and removed these endpoints.
Will community via the internal ticket for the control-plane availability issue and resolve this issue(since it's not ALBIngressController's issue)

M00nF1sh on 19 Feb 2020

👍1

Wow, that is an unexpected end to that ticket. Thank you so much for this!

Vlaaaaaaad on 19 Feb 2020

🤣

M00nF1sh on 20 Feb 2020

This issue happened to us today
F0212 07:13:21.161769 1 main.go:110] leader election lost
I0212 07:13:21.161533 1 leaderelection.go:249] failed to renew lease default/ingress-controller-leader-alb: failed to tryAcquireOrRenew context deadline exceeded
W0212 07:13:21.578656 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
EKS version: 1.13
Instance type: m5.2x large
ALB Ingress version: v1.1.2

Any workaround (or) fix to resolve this issue?