I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler v1.2.2 (on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:
I0205 23:32:52.241463 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0205 23:32:52.241542 1 main.go:384] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0xc000022100, 0xc000574000, 0x37, 0xee)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:828 +0xd4
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).output(0x4333560, 0xc000000003, 0xc00056e000, 0x429c819, 0x7, 0x180, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:779 +0x306
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).printf(0x4333560, 0x3, 0x26f2036, 0xb, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:678 +0x14b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(0x26f2036, 0xb, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1207 +0x67
main.main.func3()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:384 +0x47
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000668000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:163 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000668000, 0x29c4b00, 0xc000591dc0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:172 +0x112
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x29c4b40, 0xc000046040, 0x29cbd20, 0xc0001e6a20, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00001f030, 0x27baac0, 0x0, ...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:184 +0x99
main.main()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:372 +0x5cf
I0205 23:32:52.241724 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e78ccdca-2440-11e9-8514-0a1153ba0cc4", APIVersion:"v1", ResourceVersion:"6949892", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-57f79874cf-c45xb stopped leading
I0205 23:32:52.745013 1 auto_scaling_groups.go:124] Registering ASG XXXX
Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?
This suggests either CA has a problem reaching apiserver or apiserver is unhealthy. Can you check if the same happens to other system components (ex. kube-controller-manager?). They use the same
generic Kubernetes leader election library that CA uses. Usually when I see this problem it's because of an overloaded apiserver and it impacts multiple controllers.
@MaciekPytel Thats what I thought but rest of the cluster, including all kube-system component works fine. None of them has restarted.
To rule out version skew as the cause (Kubernetes 1.12.5 and Cluster Autoscaler 1.2.2), can you please try using newer version of autoscaler? We've Recommended versions:
Ah ok .. WIll try 1.12 ..thanks for letting me know @aleksandra-malinowska I will try out new version
@aleksandra-malinowska That did not help .. observerd same behaviour of crash in loop. It was interesting to note that the problem surfaced only when the number of nodes autoscaler managed was about 200 or more ... Every time, I brought down number of nodes anywhere 1k-200 range to 150 or less the autoscaler recovered and functioned properly. Rest of the kube-system component remained functional throughout.
Does this help in identifying where the bottleneck would be? I can confirm I have run on various versions of autoscaler ranging from 1.12.X/1.13.1 and seeing same behaviour. Autoscaler goes into crash frenzy when number of nodes >~ 200 and recovers when it comes down.
@suneeta-mall can you provide logs with strace for * CA 1.12.*. It would be easy to find the place. Also can you provide deployment script. It would help to understand what options were enabled and do you have memory limits and etc. Also it is usefull to have full logs.
Possible problems: to many queries and kube-apiserver with etcd could not handle them. You can monitor logs,cpu and memory of etcd and apiserver.
@miry Yeah sure .. I will work on getting the logs .. heres the deployment script:
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
rules:
-
apiGroups:
- ""
resources:
- events
- endpoints
verbs:
- create
- patch
-
apiGroups:
- ""
resources:
- pods/eviction
verbs:
- create
-
apiGroups:
- ""
resources:
- pods/status
verbs:
- update
-
apiGroups:
- ""
resourceNames:
- "cluster-autoscaler"
resources:
- endpoints
verbs:
- get
- update
-
apiGroups:
- ""
resources:
- nodes
verbs:
- watch
- list
- get
- update
-
apiGroups:
- ""
resources:
- pods
- services
- replicationcontrollers
- persistentvolumeclaims
- persistentvolumes
verbs:
- watch
- list
- get
-
apiGroups:
- extensions
resources:
- replicasets
- daemonsets
verbs:
- watch
- list
- get
-
apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- watch
- list
-
apiGroups:
- apps
resources:
- statefulsets
verbs:
- watch
- list
- get
-
apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- watch
- list
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
rules:
-
apiGroups:
- ""
resources:
- configmaps
verbs:
- create
-
apiGroups:
- ""
resourceNames:
- "cluster-autoscaler-status"
resources:
- configmaps
verbs:
- delete
- get
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: "cluster-autoscaler"
subjects:
-
kind: ServiceAccount
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
"k8s-app": "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: "cluster-autoscaler"
subjects:
-
kind: ServiceAccount
name: "cluster-autoscaler"
namespace: "kube-system"
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: "cluster-autoscaler"
name: "cluster-autoscaler"
namespace: "kube-system"
spec:
replicas: 1
selector:
matchLabels:
app: "cluster-autoscaler"
template:
metadata:
annotations:
ad.datadoghq.com/nginx.logs: "[{\"source\":\"autoscaler\",\"service\":\"autoscaler\"}]"
prometheus.io/port: "8085"
prometheus.io/scrape: "true"
scheduler.alpha.kubernetes.io/tolerations: "[{\"key\":\"dedicated\", \"value\":\"master\"}]"
labels:
app: "cluster-autoscaler"
"k8s-addon": "cluster-autoscaler.addons.k8s.io"
spec:
containers:
-
command:
- "./cluster-autoscaler"
- "--v=4"
- "--stderrthreshold=info"
- "--cloud-provider=aws"
- "--skip-nodes-with-system-pods=false"
- "--skip-nodes-with-local-storage=false"
- "--expander=most-pods"
- "--ignore-daemonsets-utilization=true"
- "--ignore-mirror-pods-utilization=true"
- "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/mycluster.com"
env:
-
name: AWS_REGION
value: "ap-southeast-2"
image: "k8s.gcr.io/cluster-autoscaler:v1.13.1"
imagePullPolicy: Always
livenessProbe:
httpGet:
path: "/health-check"
port: 8085
name: "cluster-autoscaler"
readinessProbe:
httpGet:
path: "/health-check"
port: 8085
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
volumeMounts:
-
mountPath: "/etc/ssl/certs/ca-certificates.crt"
name: "ssl-certs"
readOnly: true
dnsPolicy: Default
nodeSelector:
kubernetes.io/role: master
serviceAccountName: "cluster-autoscaler"
tolerations:
-
effect: NoSchedule
key: "node-role.kubernetes.io/master"
volumes:
-
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
name: "ssl-certs"
Is there any instructions on getting logs with strace when issue results into crash ? I assume you mean wrapping the autoscaler command with strace and sending the logs .. is that enough or any more specific details you are after?
As for possible problems, yes agree its certainly possible that api-server is getting too many queries but all other cluster resources including kube-system resources and my own workload seem to chug along okay. Its only autoscaler that fails to my knowledge. Its possible autoscaler is making too many calls and getting rate-limited? I have not seen much info in logs to indicate that but I will keep and eye on and update what I find.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
@suneeta-mall sorry for the delay in responding here.
Could you include the logs for the failed deploys, i.e., kubectl logs cluster-autoscaler-pod -n kube-system -p, please?
/remove-lifecycle rotten
@alejandrox1 The log is already attached in the description see here "Lost master" but kube master and all other kube component seem to function fine but autoscaler
@suneeta-mall how did you create the cluster? would you happen to have a copy of the code somehwere?
@alejandrox1 It was created with kops on AWS ... anything specific you are looking for ? The very basic version is can be created with following snippet .. which is the foundation of k8s used in this case. ETCD version 3.X
kops create cluster ${NAME} \
--cloud aws \
--master-zones ${ZONES} \
--master-size m4.xlarge \
--node-size m4.xlarge \
--zones $ZONES \
--topology public \
--networking flannel \
--kubernetes-version 1.12.8 \
--node-size m4.xlarge \
--dns-zone XXX \
--encrypt-etcd-storage
I had a similar issue on my cluster (using EKS):
F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading
Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
We're running into similar issues on a very "scaly" EKS cluster here (quite a bit of up-and-down activity during the day); our other, more stable clusters do not seem to run into the issue.
I've also noticed that this pod sometimes gets OOMKilled, so I'll try to add more memory first and will report back if it helped 馃憤
/remove-lifecycle stale
Happened for us as well:
Cluster: "v1.15.4"
Cloud: Azure
Autoscaler version: 1.15.2
I1123 18:51:25.870541 1 scale_down.go:771] No candidates for scale down
I1123 18:51:47.848093 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F1123 18:51:47.848126 1 main.go:406] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0x4cb5f01, 0x3, 0xc000678000, 0x37)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:900 +0xb1
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).output(0x4cb5fa0, 0xc000000003, 0xc000477340, 0x4c19bb1, 0x7, 0x196, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:815 +0xe6
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).printf(0x4cb5fa0, 0x3, 0x2b62471, 0xb, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:727 +0x14e
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1309
main.main.func3()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:406 +0x5c
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run.func1(0xc00026c7e0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:193 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run(0xc00026c7e0, 0x2ff65e0, 0xc0001ca740)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:202 +0x10f
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x2ff6620, 0xc0000cc018, 0x3026ee0, 0xc0002ec280, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00040f3e0, 0x2c39cc8, 0x0, ...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:214 +0x96
main.main()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x6ec
goroutine 19 [syscall, 241 minutes]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x41
goroutine 20 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x4cb5fa0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1035 +0x8b
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.init.0
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:404 +0x6c
goroutine 50 [IO wait, 241 minutes]:
internal/poll.runtime_pollWait(0x7fc633d894f0, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(pollDesc).wait(0xc0004fa198, 0x72, 0x0, 0x0, 0x2b5d3c7)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(FD).Accept(0xc0004fa180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:384 +0x1ba
net.(netFD).accept(0xc0004fa180, 0x28e75a0, 0x50, 0xc00038ef50)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(TCPListener).accept(0xc0000d01f8, 0xc000070700, 0x7fc633dd9b28, 0xc0002a8000)
/usr/local/go/src/net/tcpsock_posix.go:139 +0x32
net.(TCPListener).AcceptTCP(0xc0000d01f8, 0x40dc28, 0x30, 0x28e75a0)
/usr/local/go/src/net/tcpsock.go:247 +0x48
net/http.tcpKeepAliveListener.Accept(0xc0000d01f8, 0x28e75a0, 0xc000417710, 0x263bcc0, 0x4c9af30)
/usr/local/go/src/net/http/server.go:3264 +0x2f
net/http.(Server).Serve(0xc0003845b0, 0x2ff2ae0, 0xc0000d01f8, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2859 +0x22d
net/http.(Server).ListenAndServe(0xc0003845b0, 0xc0003845b0, 0xd)
/usr/local/go/src/net/http/server.go:2797 +0xe4
net/http.ListenAndServe(...)
/usr/local/go/src/net/http/server.go:3037
main.main.func1(0xc00038e000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:359 +0x10d
created by main.main
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x258
goroutine 12 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0001cb6c0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:207 +0x66
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewBroadcaster
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:75 +0xcc
goroutine 151 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec140, 0xc000186000, 0xc001306d20, 0xc0009515c0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 13 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ac00, 0xc00040f3a0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:268 +0xa4
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e
goroutine 11 [runnable]:
sync.(Cond).Broadcast(0xc0000d4380)
/usr/local/go/src/sync/cond.go:73 +0x91
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).processWindowUpdate(0xc000e81fb8, 0xc0009bb200, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2255 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).run(0xc000e81fb8, 0x2c38850, 0xc00001dfb8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1727 +0x6ea
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).readLoop(0xc0000a3500)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1607 +0x76
created by k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(*Transport).newClientConn
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:670 +0x637
goroutine 114 [select, 6 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec0a0, 0x2fc0880, 0xc000d8e340, 0xc001173cc0, 0xc0000d2fc0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec0a0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000694f78)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001173f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec0a0, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b
created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewUnschedulablePodInNamespaceLister
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190 +0x1eb
goroutine 14 [select]:
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).roundTrip(0xc0000a3500, 0xc000737d00, 0x0, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1081 +0x8cc
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTripOpt(0xc000144d80, 0xc000737d00, 0xc000807200, 0x6bda66, 0x0, 0xc00015f7a0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:444 +0x159
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTrip(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:406
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip(0xc000144d80, 0xc000737d00, 0xc0015b6c80, 0x5, 0xc00015f828)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2536 +0x3f
net/http.(Transport).roundTrip(0xc00015f680, 0xc000737d00, 0x248fe20, 0xc00041ef01, 0xc0008a6580)
/usr/local/go/src/net/http/transport.go:430 +0xe90
net/http.(Transport).RoundTrip(0xc00015f680, 0xc000737d00, 0x2b645a5, 0xd, 0xc0008a6650)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(bearerAuthRoundTripper).RoundTrip(0xc000442960, 0xc000737c00, 0x2b607b9, 0xa, 0xc0008a64d8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:317 +0x268
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(userAgentRoundTripper).RoundTrip(0xc00047c2e0, 0xc000737b00, 0xc00047c2e0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:167 +0x1c2
net/http.send(0xc000737b00, 0x2fb5660, 0xc00047c2e0, 0x0, 0x0, 0x0, 0xc0004e5550, 0xc0008078d0, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:250 +0x461
net/http.(Client).send(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0, 0xc0004e5550, 0x0, 0x1, 0xc000cc85a0)
/usr/local/go/src/net/http/client.go:174 +0xfb
net/http.(Client).do(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:641 +0x279
net/http.(Client).Do(0xc000442990, 0xc000737b00, 0x0, 0x39, 0x2fb34c0)
/usr/local/go/src/net/http/client.go:509 +0x35
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).request(0xc001824300, 0xc000807b80, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:737 +0x330
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).Do(0xc001824300, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:809 +0xc5
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(events).CreateWithEventNamespace(0xc00035bc20, 0xc001597180, 0xc00007fdd0, 0x14d9b8e, 0xc00007fdc8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:57 +0x25d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(EventSinkImpl).Create(0xc00040f3c0, 0xc001597180, 0x280c8c0, 0xc001330320, 0x2ff6ea0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:155 +0x3d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordEvent(0x2ff1220, 0xc00040f3c0, 0xc001597180, 0x0, 0x0, 0x0, 0xc000096000, 0xc00035bca0, 0x1)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:221 +0x12d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordToSink(0x2ff1220, 0xc00040f3c0, 0xc001096780, 0xc00035bca0, 0xc00051ac30, 0x2540be400)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:189 +0x179
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartRecordingToSink.func1(0xc001096780)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:171 +0x5c
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ade0, 0xc00051adb0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:275 +0xe8
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e
goroutine 128 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec500, 0xc000186000, 0xc000c85b00, 0xc000a190e0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 150 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec0a0, 0xc000186000, 0xc0001873e0, 0xc0000d2fc0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246
goroutine 83 [chan receive]:
main.run(0xc00038e000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:325 +0x1eb
main.main.func2(0x2ff65e0, 0xc0001ca740)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec
goroutine 115 [select, 6 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec140, 0x2fc0880, 0xc0001819c0, 0xc001175cc0, 0xc0009515c0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch(0xc0004ec140, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000364f78)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001175f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run(0xc0004ec140, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b
created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewScheduledPodLister
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:214 +0x1d9
We got the same kind of message and we have a similar config to what @suneeta-mall posted
In the aspect of memory and cpu requests (300mb ram, 100m cpu).
I don't know about the details but my issue got solved by cleaning up all the completed pods from the cluster.
I had about 5-8k pods and even running kubectl get pods --all-namespaces took a long while.
After deleting the unneeded pods all is back to working correctly.
Also had the same thing as @Pluies
I had 3 clusters with the same config but only one of them had that issue.
After v1.17.0, some permissions need to be added to rbac ClusterRole:
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
- csinodes
verbs:
- watch
- list
- get
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- watch
- list
- get
- create
- patch
- update
I have the same problem:
I0514 05:08:51.277989 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016 1 main.go:409] lost master
I am running auto scaler version 1.15.6
For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.
- --leader-elect=false
I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true
-- | -- | --
If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.
leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true
-- | -- | --
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs
Is disabling leader election really recommended? All of the official examples I'm aware of specify replicas: 1 but keep the default value for leader-elect.
Even when running replicas: 1, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.
We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.
I0111 09:12:15.398008 1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received
E0111 09:12:26.102040 1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded
I0111 09:12:27.499348 1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading
I0111 09:12:27.698012 1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0111 09:12:28.597994 1 main.go:426] lost master
/reopen
/remove-lifecycle rotten
@svaranasi-traderev: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.
I0111 09:12:15.398008 1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received E0111 09:12:26.102040 1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded I0111 09:12:27.499348 1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading I0111 09:12:27.698012 1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition F0111 09:12:28.597994 1 main.go:426] lost master/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I have the same problem:
I am running auto scaler version 1.15.6
For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.