Autoscaler: failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew

Created on 6 Feb 2019  路  29Comments  路  Source: kubernetes/autoscaler

I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler v1.2.2 (on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:

I0205 23:32:52.241463 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded F0205 23:32:52.241542 1 main.go:384] lost master goroutine 1 [running]: k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0xc000022100, 0xc000574000, 0x37, 0xee) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:828 +0xd4 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).output(0x4333560, 0xc000000003, 0xc00056e000, 0x429c819, 0x7, 0x180, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:779 +0x306 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).printf(0x4333560, 0x3, 0x26f2036, 0xb, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:678 +0x14b k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(0x26f2036, 0xb, 0x0, 0x0, 0x0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1207 +0x67 main.main.func3() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:384 +0x47 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000668000) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:163 +0x40 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000668000, 0x29c4b00, 0xc000591dc0) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:172 +0x112 k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x29c4b40, 0xc000046040, 0x29cbd20, 0xc0001e6a20, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00001f030, 0x27baac0, 0x0, ...) /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:184 +0x99 main.main() /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:372 +0x5cf I0205 23:32:52.241724 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e78ccdca-2440-11e9-8514-0a1153ba0cc4", APIVersion:"v1", ResourceVersion:"6949892", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-57f79874cf-c45xb stopped leading I0205 23:32:52.745013 1 auto_scaling_groups.go:124] Registering ASG XXXX

Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?

cluster-autoscaler

Most helpful comment

I have the same problem:

I0514 05:08:51.277989       1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016       1 main.go:409] lost master

I am running auto scaler version 1.15.6

For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.

        - --leader-elect=false

All 29 comments

This suggests either CA has a problem reaching apiserver or apiserver is unhealthy. Can you check if the same happens to other system components (ex. kube-controller-manager?). They use the same
generic Kubernetes leader election library that CA uses. Usually when I see this problem it's because of an overloaded apiserver and it impacts multiple controllers.

@MaciekPytel Thats what I thought but rest of the cluster, including all kube-system component works fine. None of them has restarted.

To rule out version skew as the cause (Kubernetes 1.12.5 and Cluster Autoscaler 1.2.2), can you please try using newer version of autoscaler? We've Recommended versions:

  • Kubernetes 1.10.* with CA 1.2.*
  • Kubernetes 1.11.* with CA 1.3.*
  • Kubernetes 1.12.* with CA 1.12.* (we've changed versioning scheme to match Kubernetes' minor version)

Ah ok .. WIll try 1.12 ..thanks for letting me know @aleksandra-malinowska I will try out new version

@aleksandra-malinowska That did not help .. observerd same behaviour of crash in loop. It was interesting to note that the problem surfaced only when the number of nodes autoscaler managed was about 200 or more ... Every time, I brought down number of nodes anywhere 1k-200 range to 150 or less the autoscaler recovered and functioned properly. Rest of the kube-system component remained functional throughout.
Does this help in identifying where the bottleneck would be? I can confirm I have run on various versions of autoscaler ranging from 1.12.X/1.13.1 and seeing same behaviour. Autoscaler goes into crash frenzy when number of nodes >~ 200 and recovers when it comes down.

@suneeta-mall can you provide logs with strace for * CA 1.12.*. It would be easy to find the place. Also can you provide deployment script. It would help to understand what options were enabled and do you have memory limits and etc. Also it is usefull to have full logs.

Possible problems: to many queries and kube-apiserver with etcd could not handle them. You can monitor logs,cpu and memory of etcd and apiserver.

@miry Yeah sure .. I will work on getting the logs .. heres the deployment script:

apiVersion: v1
kind: ServiceAccount
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
rules: 
  - 
    apiGroups: 
      - ""
    resources: 
      - events
      - endpoints
    verbs: 
      - create
      - patch
  - 
    apiGroups: 
      - ""
    resources: 
      - pods/eviction
    verbs: 
      - create
  - 
    apiGroups: 
      - ""
    resources: 
      - pods/status
    verbs: 
      - update
  - 
    apiGroups: 
      - ""
    resourceNames: 
      - "cluster-autoscaler"
    resources: 
      - endpoints
    verbs: 
      - get
      - update
  - 
    apiGroups: 
      - ""
    resources: 
      - nodes
    verbs: 
      - watch
      - list
      - get
      - update
  - 
    apiGroups: 
      - ""
    resources: 
      - pods
      - services
      - replicationcontrollers
      - persistentvolumeclaims
      - persistentvolumes
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - extensions
    resources: 
      - replicasets
      - daemonsets
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - policy
    resources: 
      - poddisruptionbudgets
    verbs: 
      - watch
      - list
  - 
    apiGroups: 
      - apps
    resources: 
      - statefulsets
    verbs: 
      - watch
      - list
      - get
  - 
    apiGroups: 
      - storage.k8s.io
    resources: 
      - storageclasses
    verbs: 
      - watch
      - list
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
rules: 
  - 
    apiGroups: 
      - ""
    resources: 
      - configmaps
    verbs: 
      - create
  - 
    apiGroups: 
      - ""
    resourceNames: 
      - "cluster-autoscaler-status"
    resources: 
      - configmaps
    verbs: 
      - delete
      - get
      - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
roleRef: 
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: "cluster-autoscaler"
subjects: 
  - 
    kind: ServiceAccount
    name: "cluster-autoscaler"
    namespace: "kube-system"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: 
  labels: 
    "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    "k8s-app": "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
roleRef: 
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: "cluster-autoscaler"
subjects: 
  - 
    kind: ServiceAccount
    name: "cluster-autoscaler"
    namespace: "kube-system"
---
apiVersion: apps/v1
kind: Deployment
metadata: 
  labels: 
    app: "cluster-autoscaler"
  name: "cluster-autoscaler"
  namespace: "kube-system"
spec: 
  replicas: 1
  selector: 
    matchLabels: 
      app: "cluster-autoscaler"
  template: 
    metadata: 
      annotations: 
        ad.datadoghq.com/nginx.logs: "[{\"source\":\"autoscaler\",\"service\":\"autoscaler\"}]"
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
        scheduler.alpha.kubernetes.io/tolerations: "[{\"key\":\"dedicated\", \"value\":\"master\"}]"
      labels: 
        app: "cluster-autoscaler"
        "k8s-addon": "cluster-autoscaler.addons.k8s.io"
    spec: 
      containers: 
        - 
          command: 
            - "./cluster-autoscaler"
            - "--v=4"
            - "--stderrthreshold=info"
            - "--cloud-provider=aws"
            - "--skip-nodes-with-system-pods=false"
            - "--skip-nodes-with-local-storage=false"
            - "--expander=most-pods"
            - "--ignore-daemonsets-utilization=true"
            - "--ignore-mirror-pods-utilization=true"
            - "--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/mycluster.com"
          env: 
            - 
              name: AWS_REGION
              value: "ap-southeast-2"
          image: "k8s.gcr.io/cluster-autoscaler:v1.13.1"
          imagePullPolicy: Always
          livenessProbe: 
            httpGet: 
              path: "/health-check"
              port: 8085
          name: "cluster-autoscaler"
          readinessProbe: 
            httpGet: 
              path: "/health-check"
              port: 8085
          resources: 
            limits: 
              cpu: 100m
              memory: 300Mi
            requests: 
              cpu: 100m
              memory: 300Mi
          volumeMounts: 
            - 
              mountPath: "/etc/ssl/certs/ca-certificates.crt"
              name: "ssl-certs"
              readOnly: true
      dnsPolicy: Default
      nodeSelector: 
        kubernetes.io/role: master
      serviceAccountName: "cluster-autoscaler"
      tolerations: 
        - 
          effect: NoSchedule
          key: "node-role.kubernetes.io/master"
      volumes: 
        - 
          hostPath: 
            path: "/etc/ssl/certs/ca-certificates.crt"
          name: "ssl-certs"

Is there any instructions on getting logs with strace when issue results into crash ? I assume you mean wrapping the autoscaler command with strace and sending the logs .. is that enough or any more specific details you are after?

As for possible problems, yes agree its certainly possible that api-server is getting too many queries but all other cluster resources including kube-system resources and my own workload seem to chug along okay. Its only autoscaler that fails to my knowledge. Its possible autoscaler is making too many calls and getting rate-limited? I have not seen much info in logs to indicate that but I will keep and eye on and update what I find.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@suneeta-mall sorry for the delay in responding here.
Could you include the logs for the failed deploys, i.e., kubectl logs cluster-autoscaler-pod -n kube-system -p, please?

/remove-lifecycle rotten

@alejandrox1 The log is already attached in the description see here "Lost master" but kube master and all other kube component seem to function fine but autoscaler

@suneeta-mall how did you create the cluster? would you happen to have a copy of the code somehwere?

@alejandrox1 It was created with kops on AWS ... anything specific you are looking for ? The very basic version is can be created with following snippet .. which is the foundation of k8s used in this case. ETCD version 3.X

kops create cluster ${NAME} \
    --cloud aws \
    --master-zones ${ZONES} \
    --master-size m4.xlarge \
    --node-size m4.xlarge \
    --zones $ZONES \
    --topology public \
    --networking flannel \
    --kubernetes-version 1.12.8 \
    --node-size m4.xlarge \
    --dns-zone XXX \
    --encrypt-etcd-storage    

I had a similar issue on my cluster (using EKS):

F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading

Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

We're running into similar issues on a very "scaly" EKS cluster here (quite a bit of up-and-down activity during the day); our other, more stable clusters do not seem to run into the issue.
I've also noticed that this pod sometimes gets OOMKilled, so I'll try to add more memory first and will report back if it helped 馃憤

/remove-lifecycle stale

Happened for us as well:
Cluster: "v1.15.4"
Cloud: Azure
Autoscaler version: 1.15.2

I1123 18:51:25.870541 1 scale_down.go:771] No candidates for scale down
I1123 18:51:47.848093 1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F1123 18:51:47.848126 1 main.go:406] lost master

goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0x4cb5f01, 0x3, 0xc000678000, 0x37)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:900 +0xb1
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(loggingT).output(0x4cb5fa0, 0xc000000003, 0xc000477340, 0x4c19bb1, 0x7, 0x196, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:815 +0xe6
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(
loggingT).printf(0x4cb5fa0, 0x3, 0x2b62471, 0xb, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:727 +0x14e
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1309
main.main.func3()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:406 +0x5c
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(LeaderElector).Run.func1(0xc00026c7e0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:193 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(
LeaderElector).Run(0xc00026c7e0, 0x2ff65e0, 0xc0001ca740)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:202 +0x10f
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x2ff6620, 0xc0000cc018, 0x3026ee0, 0xc0002ec280, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00040f3e0, 0x2c39cc8, 0x0, ...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:214 +0x96
main.main()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x6ec

goroutine 19 [syscall, 241 minutes]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 20 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x4cb5fa0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1035 +0x8b
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.init.0
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:404 +0x6c

goroutine 50 [IO wait, 241 minutes]:
internal/poll.runtime_pollWait(0x7fc633d894f0, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:182 +0x56
internal/poll.(pollDesc).wait(0xc0004fa198, 0x72, 0x0, 0x0, 0x2b5d3c7)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x9b
internal/poll.(
pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(FD).Accept(0xc0004fa180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:384 +0x1ba
net.(
netFD).accept(0xc0004fa180, 0x28e75a0, 0x50, 0xc00038ef50)
/usr/local/go/src/net/fd_unix.go:238 +0x42
net.(TCPListener).accept(0xc0000d01f8, 0xc000070700, 0x7fc633dd9b28, 0xc0002a8000)
/usr/local/go/src/net/tcpsock_posix.go:139 +0x32
net.(
TCPListener).AcceptTCP(0xc0000d01f8, 0x40dc28, 0x30, 0x28e75a0)
/usr/local/go/src/net/tcpsock.go:247 +0x48
net/http.tcpKeepAliveListener.Accept(0xc0000d01f8, 0x28e75a0, 0xc000417710, 0x263bcc0, 0x4c9af30)
/usr/local/go/src/net/http/server.go:3264 +0x2f
net/http.(Server).Serve(0xc0003845b0, 0x2ff2ae0, 0xc0000d01f8, 0x0, 0x0)
/usr/local/go/src/net/http/server.go:2859 +0x22d
net/http.(
Server).ListenAndServe(0xc0003845b0, 0xc0003845b0, 0xd)
/usr/local/go/src/net/http/server.go:2797 +0xe4
net/http.ListenAndServe(...)
/usr/local/go/src/net/http/server.go:3037
main.main.func1(0xc00038e000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:359 +0x10d
created by main.main
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:356 +0x258

goroutine 12 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0001cb6c0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:207 +0x66
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch.NewBroadcaster
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/watch/mux.go:75 +0xcc

goroutine 151 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec140, 0xc000186000, 0xc001306d20, 0xc0009515c0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 13 [chan receive]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ac00, 0xc00040f3a0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:268 +0xa4
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(
eventBroadcasterImpl).StartEventWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e

goroutine 11 [runnable]:
sync.(Cond).Broadcast(0xc0000d4380)
/usr/local/go/src/sync/cond.go:73 +0x91
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(
clientConnReadLoop).processWindowUpdate(0xc000e81fb8, 0xc0009bb200, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2255 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(clientConnReadLoop).run(0xc000e81fb8, 0x2c38850, 0xc00001dfb8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1727 +0x6ea
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(
ClientConn).readLoop(0xc0000a3500)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1607 +0x76
created by k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(*Transport).newClientConn
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:670 +0x637

goroutine 114 [select, 6 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec0a0, 0x2fc0880, 0xc000d8e340, 0xc001173cc0, 0xc0000d2fc0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).ListAndWatch(0xc0004ec0a0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000694f78)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001173f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).Run(0xc0004ec0a0, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b
created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewUnschedulablePodInNamespaceLister
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190 +0x1eb

goroutine 14 [select]:
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(ClientConn).roundTrip(0xc0000a3500, 0xc000737d00, 0x0, 0x0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:1081 +0x8cc
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(
Transport).RoundTripOpt(0xc000144d80, 0xc000737d00, 0xc000807200, 0x6bda66, 0x0, 0xc00015f7a0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:444 +0x159
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.(Transport).RoundTrip(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:406
k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip(0xc000144d80, 0xc000737d00, 0xc0015b6c80, 0x5, 0xc00015f828)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/golang.org/x/net/http2/transport.go:2536 +0x3f
net/http.(
Transport).roundTrip(0xc00015f680, 0xc000737d00, 0x248fe20, 0xc00041ef01, 0xc0008a6580)
/usr/local/go/src/net/http/transport.go:430 +0xe90
net/http.(Transport).RoundTrip(0xc00015f680, 0xc000737d00, 0x2b645a5, 0xd, 0xc0008a6650)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(
bearerAuthRoundTripper).RoundTrip(0xc000442960, 0xc000737c00, 0x2b607b9, 0xa, 0xc0008a64d8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:317 +0x268
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport.(userAgentRoundTripper).RoundTrip(0xc00047c2e0, 0xc000737b00, 0xc00047c2e0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/transport/round_trippers.go:167 +0x1c2
net/http.send(0xc000737b00, 0x2fb5660, 0xc00047c2e0, 0x0, 0x0, 0x0, 0xc0004e5550, 0xc0008078d0, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:250 +0x461
net/http.(
Client).send(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0, 0xc0004e5550, 0x0, 0x1, 0xc000cc85a0)
/usr/local/go/src/net/http/client.go:174 +0xfb
net/http.(Client).do(0xc000442990, 0xc000737b00, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:641 +0x279
net/http.(
Client).Do(0xc000442990, 0xc000737b00, 0x0, 0x39, 0x2fb34c0)
/usr/local/go/src/net/http/client.go:509 +0x35
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(Request).request(0xc001824300, 0xc000807b80, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:737 +0x330
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest.(
Request).Do(0xc001824300, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/rest/request.go:809 +0xc5
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(events).CreateWithEventNamespace(0xc00035bc20, 0xc001597180, 0xc00007fdd0, 0x14d9b8e, 0xc00007fdc8)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:57 +0x25d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(
EventSinkImpl).Create(0xc00040f3c0, 0xc001597180, 0x280c8c0, 0xc001330320, 0x2ff6ea0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:155 +0x3d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordEvent(0x2ff1220, 0xc00040f3c0, 0xc001597180, 0x0, 0x0, 0x0, 0xc000096000, 0xc00035bca0, 0x1)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:221 +0x12d
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.recordToSink(0x2ff1220, 0xc00040f3c0, 0xc001096780, 0xc00035bca0, 0xc00051ac30, 0x2540be400)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:189 +0x179
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(eventBroadcasterImpl).StartRecordingToSink.func1(0xc001096780)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:171 +0x5c
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(
eventBroadcasterImpl).StartEventWatcher.func1(0x2fc08c0, 0xc00051ade0, 0xc00051adb0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:275 +0xe8
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/record/event.go:266 +0x6e

goroutine 128 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec500, 0xc000186000, 0xc000c85b00, 0xc000a190e0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 150 [select, 2 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).ListAndWatch.func2(0xc0004ec0a0, 0xc000186000, 0xc0001873e0, 0xc0000d2fc0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:235 +0x150
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).ListAndWatch
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:229 +0x246

goroutine 83 [chan receive]:
main.run(0xc00038e000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:325 +0x1eb
main.main.func2(0x2ff65e0, 0xc0001ca740)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

goroutine 115 [select, 6 minutes]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).watchHandler(0xc0004ec140, 0x2fc0880, 0xc0001819c0, 0xc001175cc0, 0xc0009515c0, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:329 +0x1d9
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).ListAndWatch(0xc0004ec140, 0xc000186000, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:300 +0x879
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(Reflector).Run.func1()
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:124 +0x33
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000364f78)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001175f78, 0x3b9aca00, 0x0, 0x1, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache.(
Reflector).Run(0xc0004ec140, 0xc000186000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/cache/reflector.go:123 +0x16b
created by k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes.NewScheduledPodLister
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:214 +0x1d9

We got the same kind of message and we have a similar config to what @suneeta-mall posted
In the aspect of memory and cpu requests (300mb ram, 100m cpu).
I don't know about the details but my issue got solved by cleaning up all the completed pods from the cluster.
I had about 5-8k pods and even running kubectl get pods --all-namespaces took a long while.
After deleting the unneeded pods all is back to working correctly.
Also had the same thing as @Pluies
I had 3 clusters with the same config but only one of them had that issue.

After v1.17.0, some permissions need to be added to rbac ClusterRole:

  - apiGroups:
    - storage.k8s.io
    resources:
    - storageclasses
    - csinodes
    verbs:
    - watch
    - list
    - get
  - apiGroups:
    - coordination.k8s.io
    resources:
    - leases
    verbs:
    - watch
    - list
    - get
    - create
    - patch
    - update

I have the same problem:

I0514 05:08:51.277989       1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016       1 main.go:409] lost master

I am running auto scaler version 1.15.6

For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.

        - --leader-elect=false

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs
image

leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true
-- | -- | --

If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.

leader-elect | Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability | true
-- | -- | --

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

Is disabling leader election really recommended? All of the official examples I'm aware of specify replicas: 1 but keep the default value for leader-elect.

Even when running replicas: 1, wouldn't leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there'd be periods where you could have multiple CA pods stepping on each other.

We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.

I0111 09:12:15.398008       1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received

E0111 09:12:26.102040       1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded                                                                                                

I0111 09:12:27.499348       1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading

I0111 09:12:27.698012       1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition

F0111 09:12:28.597994       1 main.go:426] lost master

/reopen
/remove-lifecycle rotten

@svaranasi-traderev: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

We're seeing the same issue on our EKS cluster with 40+ nodes, running 1.16.5.

I0111 09:12:15.398008       1 reflector.go:496] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356: Watch close - *v1.StatefulSet total 0 items received

E0111 09:12:26.102040       1 leaderelection.go:356] Failed to update lock: Put https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded                                                                                                

I0111 09:12:27.499348       1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"", Name:"", UID:"", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7564f9cf59-q287j stopped leading

I0111 09:12:27.698012       1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition

F0111 09:12:28.597994       1 main.go:426] lost master

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chapati23 picture chapati23  路  4Comments

davidquarles picture davidquarles  路  7Comments

clamoriniere picture clamoriniere  路  5Comments

tjliupeng picture tjliupeng  路  6Comments

hjkatz picture hjkatz  路  4Comments