Kubespray: controller-manager and scheduler enter CrashLoopBackOff state

Created on 15 Jul 2020 · 6Comments · Source: kubernetes-sigs/kubespray

Environment:

Cloud provider or hardware configuration:

three fedora 31 in openstack, 4cpu/8GB memroy
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

fedora@fcos-master-dev ~]$ uname -srm
Linux 5.3.7-301.fc31.x86_64 x86_64
[fedora@fcos-master-dev ~]$ cat /etc/os-release
NAME=Fedora
VERSION="31 (Cloud Edition)"
ID=fedora
VERSION_ID=31
VERSION_CODENAME=""
PLATFORM_ID="platform:f31"
PRETTY_NAME="Fedora 31 (Cloud Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:31"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f31/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=31
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=31
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Cloud Edition"
VARIANT_ID=cloud

Version of Ansible (ansible --version):

ansible 2.9.6
  config file = None
  configured module search path = ['/Users/hgye/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/hgye/Envs/kubespray/lib/python3.7/site-packages/ansible
  executable location = /Users/hgye/Envs/kubespray/bin/ansible
  python version = 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)]

Version of Python (python --version):
Python 3.7.4

Kubespray version (commit) (git rev-parse --short HEAD):
current master
5e225744

Network plugin used:
calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

https://gist.github.com/hgye/b57553036f908493be4df59e64f0ca1c

Command used to invoke ansible:

ansible-playbook -i inventory/onuware-dev/inventory.ini --become --become-user=root upgrade-cluster.yml --private-key=/Users/hgye/Downloads/openstack-key/onu-k8s-dev.pem

Output of ansible run:

https://gist.github.com/hgye/3831c78add843dc048359b286d53e01c

Anything else do we need to know:

kubectl get all

 hgye@HuaiguangYedeMacBook-Pro  ~/git/kubespray   master ● ⍟1  kubectl get all -n kube-system                                                                                                   SIGINT(2) ↵  12151  21:20:43  100% 🔋
NAME                                              READY   STATUS             RESTARTS   AGE
pod/calico-kube-controllers-fbcb89fbd-v9j6f       1/1     Running            1          87m
pod/calico-node-4ddpr                             1/1     Running            1          87m
pod/calico-node-kwpdw                             1/1     Running            5          93m
pod/calico-node-xw44x                             1/1     Running            4          89m
pod/coredns-dff8fc7d-4665f                        1/1     Running            1          77m
pod/coredns-dff8fc7d-t46cx                        1/1     Running            1          77m
pod/dns-autoscaler-66498f5c5f-nftlf               1/1     Running            1          87m
pod/kube-apiserver-fcos-master-dev                1/1     Running            17         4h29m
pod/kube-controller-manager-fcos-master-dev       0/1     CrashLoopBackOff   50         121m
pod/kube-proxy-5x2hs                              1/1     Running            1          3h1m
pod/kube-proxy-hsb8n                              1/1     Running            1          165m
pod/kube-proxy-x5j57                              1/1     Running            1          175m
pod/kube-scheduler-fcos-master-dev                0/1     CrashLoopBackOff   19         4h28m
pod/kubernetes-dashboard-57777fbdcb-7nvq2         1/1     Running            5          87m
pod/kubernetes-metrics-scraper-54fbb4d595-7d2n5   1/1     Running            1          87m
pod/nginx-proxy-fcos-slave1-dev                   1/1     Running            5          8h
pod/nginx-proxy-fcos-slave2-dev                   1/1     Running            5          8h
pod/nodelocaldns-52w6z                            1/1     Running            5          8h
pod/nodelocaldns-8pr77                            1/1     Running            5          8h
pod/nodelocaldns-lmjz5                            1/1     Running            5          8h

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
service/coredns                     ClusterIP   10.233.0.3      <none>        53/UDP,53/TCP,9153/TCP   8h
service/dashboard-metrics-scraper   ClusterIP   10.233.18.165   <none>        8000/TCP                 8h
service/kubernetes-dashboard        ClusterIP   10.233.49.42    <none>        443/TCP                  8h

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/calico-node    3         3         3       3            3           <none>                   8h
daemonset.apps/kube-proxy     3         3         3       3            3           kubernetes.io/os=linux   8h
daemonset.apps/nodelocaldns   3         3         3       3            3           <none>                   8h

NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers      1/1     1            1           8h
deployment.apps/coredns                      2/2     2            2           84m
deployment.apps/dns-autoscaler               1/1     1            1           8h
deployment.apps/kubernetes-dashboard         1/1     1            1           8h
deployment.apps/kubernetes-metrics-scraper   1/1     1            1           8h

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/calico-kube-controllers-fbcb89fbd       1         1         1       8h
replicaset.apps/coredns-dff8fc7d                        2         2         2       77m
replicaset.apps/dns-autoscaler-66498f5c5f               1         1         1       8h
replicaset.apps/kubernetes-dashboard-57777fbdcb         1         1         1       8h
replicaset.apps/kubernetes-metrics-scraper-54fbb4d595   1         1         1       8h

get cs output

kubectl get cs                                                                                                                             ✔  12117  21:29:35  100% 🔋
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused

describe pods

 hgye@HuaiguangYedeMacBook-Pro  ~/git/kubespray   master ● ⍟1  kubectl describe pod kube-scheduler-fcos-master-dev -n kube-system                                                                         ✔  12118  21:29:40  100% 🔋
Name:                 kube-scheduler-fcos-master-dev
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 fcos-master-dev/192.168.30.118
Start Time:           Wed, 15 Jul 2020 17:36:05 +0800
Labels:               component=kube-scheduler
                      tier=control-plane
Annotations:          kubernetes.io/config.hash: 22e409dccec0667ed93153f76c5b2111
                      kubernetes.io/config.mirror: 22e409dccec0667ed93153f76c5b2111
                      kubernetes.io/config.seen: 2020-07-15T08:52:06.525395751Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   192.168.30.118
IPs:
  IP:           192.168.30.118
Controlled By:  Node/fcos-master-dev
Containers:
  kube-scheduler:
    Container ID:  docker://8471885ba8976eadb792ef4509efd012b3adfa823caf97491119bb9ce0e00b93
    Image:         k8s.gcr.io/kube-scheduler:v1.18.5
    Image ID:      docker-pullable://k8s.gcr.io/kube-scheduler@sha256:6c8327a5a2527cc990e2bcfb4386220a025c42351546f3cc577efbac2d0b0efe
    Port:          <none>
    Host Port:     <none>
    Command:
      kube-scheduler
      --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
      --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
      --bind-address=0.0.0.0
      --kubeconfig=/etc/kubernetes/scheduler.conf
      --leader-elect=true
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 15 Jul 2020 21:21:22 +0800
      Finished:     Wed, 15 Jul 2020 21:25:49 +0800
    Ready:          False
    Restart Count:  20
    Requests:
      cpu:     100m
    Liveness:  http-get https://:10259/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
    Environment:
      HTTP_PROXY:   http://192.168.20.1:9091/
      http_proxy:   http://192.168.20.1:9091/
      NO_PROXY:     192.168.30.118,fcos-master-dev,fcos-master-dev.cluster.onuware-dev.loc,192.168.30.239,fcos-slave1-dev,fcos-slave1-dev.cluster.onuware-dev.loc,192.168.30.250,fcos-slave2-dev,fcos-slave2-dev.cluster.onuware-dev.loc,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18
      no_proxy:     192.168.30.118,fcos-master-dev,fcos-master-dev.cluster.onuware-dev.loc,192.168.30.239,fcos-slave1-dev,fcos-slave1-dev.cluster.onuware-dev.loc,192.168.30.250,fcos-slave2-dev,fcos-slave2-dev.cluster.onuware-dev.loc,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18
      HTTPS_PROXY:  http://192.168.20.1:9091/
      https_proxy:  http://192.168.20.1:9091/
    Mounts:
      /etc/kubernetes/scheduler.conf from kubeconfig (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kubeconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/scheduler.conf
    HostPathType:  FileOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute
Events:
  Type     Reason     Age                   From                      Message
  ----     ------     ----                  ----                      -------
  Warning  Unhealthy  4m24s (x13 over 70m)  kubelet, fcos-master-dev  Liveness probe failed: Get https://192.168.30.118:10259/healthz: dial tcp 192.168.30.118:10259: connect: connection refused
  Warning  BackOff    92s (x205 over 76m)   kubelet, fcos-master-dev  Back-off restarting failed container
 hgye@HuaiguangYedeMacBook-Pro  ~/git/kubespray   master ● ⍟1  kubectl describe pod kube-controller-manager-fcos-master-dev -n kube-system                                                                ✔  12119  21:30:20  100% 🔋
Name:                 kube-controller-manager-fcos-master-dev
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 fcos-master-dev/192.168.30.118
Start Time:           Wed, 15 Jul 2020 15:57:19 +0800
Labels:               component=kube-controller-manager
                      tier=control-plane
Annotations:          kubernetes.io/config.hash: 1e3b725cfad100018d51ae8adef941d1
                      kubernetes.io/config.mirror: 1e3b725cfad100018d51ae8adef941d1
                      kubernetes.io/config.seen: 2020-07-15T10:17:08.798542813Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   192.168.30.118
IPs:
  IP:           192.168.30.118
Controlled By:  Node/fcos-master-dev
Containers:
  kube-controller-manager:
    Container ID:  docker://a6289d72cfd3e27bfd1c0cb50ce47a3d2cd150125de54e820f9870289bd1b9c2
    Image:         k8s.gcr.io/kube-controller-manager:v1.18.5
    Image ID:      docker-pullable://k8s.gcr.io/kube-controller-manager@sha256:74927dcab888f1ddec67bca3bb8fa5e682d5ed427a9de6ba77304555bf458865
    Port:          <none>
    Host Port:     <none>
    Command:
      kube-controller-manager
      --allocate-node-cidrs=true
      --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
      --bind-address=0.0.0.0
      --client-ca-file=/etc/kubernetes/ssl/ca.crt
      --cluster-cidr=10.233.64.0/18
      --cluster-name=cluster.onuware-dev.loc
      --cluster-signing-cert-file=/etc/kubernetes/ssl/ca.crt
      --cluster-signing-key-file=/etc/kubernetes/ssl/ca.key
      --configure-cloud-routes=false
      --controllers=*,bootstrapsigner,tokencleaner
      --kubeconfig=/etc/kubernetes/controller-manager.conf
      --leader-elect=true
      --node-cidr-mask-size=24
      --node-monitor-grace-period=40s
      --node-monitor-period=5s
      --pod-eviction-timeout=5m0s
      --profiling=False
      --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt
      --root-ca-file=/etc/kubernetes/ssl/ca.crt
      --service-account-private-key-file=/etc/kubernetes/ssl/sa.key
      --service-cluster-ip-range=10.233.0.0/18
      --terminated-pod-gc-threshold=12500
      --use-service-account-credentials=true
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 15 Jul 2020 21:24:58 +0800
      Finished:     Wed, 15 Jul 2020 21:25:57 +0800
    Ready:          False
    Restart Count:  51
    Requests:
      cpu:     200m
    Liveness:  http-get https://:10257/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
    Environment:
      HTTP_PROXY:   http://192.168.20.1:9091/
      http_proxy:   http://192.168.20.1:9091/
      NO_PROXY:     192.168.30.118,fcos-master-dev,fcos-master-dev.cluster.onuware-dev.loc,192.168.30.239,fcos-slave1-dev,fcos-slave1-dev.cluster.onuware-dev.loc,192.168.30.250,fcos-slave2-dev,fcos-slave2-dev.cluster.onuware-dev.loc,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18
      no_proxy:     192.168.30.118,fcos-master-dev,fcos-master-dev.cluster.onuware-dev.loc,192.168.30.239,fcos-slave1-dev,fcos-slave1-dev.cluster.onuware-dev.loc,192.168.30.250,fcos-slave2-dev,fcos-slave2-dev.cluster.onuware-dev.loc,127.0.0.1,localhost,10.233.0.0/18,10.233.64.0/18
      HTTPS_PROXY:  http://192.168.20.1:9091/
      https_proxy:  http://192.168.20.1:9091/
    Mounts:
      /etc/kubernetes/controller-manager.conf from kubeconfig (ro)
      /etc/kubernetes/ssl from k8s-certs (ro)
      /etc/pki from etc-pki (ro)
      /etc/ssl/certs from ca-certs (ro)
      /usr/libexec/kubernetes/kubelet-plugins/volume/exec from flexvolume-dir (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  ca-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs
    HostPathType:  DirectoryOrCreate
  etc-pki:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/pki
    HostPathType:  DirectoryOrCreate
  flexvolume-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec
    HostPathType:  DirectoryOrCreate
  k8s-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/ssl
    HostPathType:  DirectoryOrCreate
  kubeconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/controller-manager.conf
    HostPathType:  FileOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute
Events:
  Type     Reason     Age                   From                      Message
  ----     ------     ----                  ----                      -------
  Normal   Pulled     46m (x8 over 78m)     kubelet, fcos-master-dev  Container image "k8s.gcr.io/kube-controller-manager:v1.18.5" already present on machine
  Warning  Unhealthy  40m (x14 over 75m)    kubelet, fcos-master-dev  Liveness probe failed: Get https://192.168.30.118:10257/healthz: dial tcp 192.168.30.118:10257: connect: connection refused
  Warning  BackOff    114s (x181 over 66m)  kubelet, fcos-master-dev  Back-off restarting failed container

docker logs

[fedora@fcos-master-dev ~]$ sudo docker logs k8s_kube-scheduler_kube-scheduler-fcos-master-dev_kube-system_22e409dccec0667ed93153f76c5b2111_20
I0715 13:21:22.857794       1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0715 13:21:22.857894       1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0715 13:21:23.768574       1 serving.go:313] Generated self-signed cert in-memory
I0715 13:21:24.990596       1 registry.go:150] Registering EvenPodsSpread predicate and priority function
I0715 13:21:24.990625       1 registry.go:150] Registering EvenPodsSpread predicate and priority function
W0715 13:21:24.993078       1 authorization.go:47] Authorization is disabled
W0715 13:21:24.993097       1 authentication.go:40] Authentication is disabled
I0715 13:21:24.993134       1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0715 13:21:24.994572       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0715 13:21:24.994606       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0715 13:21:24.994624       1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0715 13:21:24.994624       1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0715 13:21:24.995000       1 secure_serving.go:178] Serving securely on [::]:10259
I0715 13:21:24.995041       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0715 13:21:25.095000       1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0715 13:21:25.095021       1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0715 13:21:25.095227       1 leaderelection.go:242] attempting to acquire leader lease  kube-system/kube-scheduler...
I0715 13:21:46.534795       1 leaderelection.go:252] successfully acquired lease kube-system/kube-scheduler
E0715 13:25:49.761395       1 leaderelection.go:356] Failed to update lock: resource name may not be empty
I0715 13:25:49.763133       1 leaderelection.go:277] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
F0715 13:25:49.763212       1 server.go:244] leaderelection lost
[fedora@fcos-master-dev ~]$

[fedora@fcos-master-dev ~]$ sudo docker logs k8s_kube-controller-manager_kube-controller-manager-fcos-master-dev_kube-system_1e3b725cfad100018d51ae8adef941d1_51
I0715 13:24:59.347185       1 serving.go:313] Generated self-signed cert in-memory
I0715 13:24:59.810406       1 controllermanager.go:161] Version: v1.18.5
I0715 13:24:59.811800       1 dynamic_cafile_content.go:167] Starting request-header::/etc/kubernetes/ssl/front-proxy-ca.crt
I0715 13:24:59.812047       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/ssl/ca.crt
I0715 13:24:59.812194       1 secure_serving.go:178] Serving securely on [::]:10257
I0715 13:24:59.812371       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0715 13:24:59.812988       1 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
I0715 13:24:59.813116       1 leaderelection.go:242] attempting to acquire leader lease  kube-system/kube-controller-manager...
I0715 13:25:22.788036       1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager
I0715 13:25:22.790408       1 event.go:278] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-controller-manager", UID:"5ecf7d67-c833-41f5-8ce7-ebc317f624c4", APIVersion:"v1", ResourceVersion:"32728", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' fcos-master-dev.novalocal_e2924774-3c97-4259-9b5d-e91f1a1cd5a7 became leader
I0715 13:25:22.790479       1 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"kube-controller-manager", UID:"04946580-bd36-4ed5-a78d-409baab838b5", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"32729", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' fcos-master-dev.novalocal_e2924774-3c97-4259-9b5d-e91f1a1cd5a7 became leader
I0715 13:25:23.718755       1 plugins.go:100] No cloud provider specified.
I0715 13:25:23.721213       1 shared_informer.go:223] Waiting for caches to sync for tokens
I0715 13:25:23.822412       1 shared_informer.go:230] Caches are synced for tokens
I0715 13:25:23.844064       1 controllermanager.go:533] Started "namespace"
I0715 13:25:23.844233       1 namespace_controller.go:200] Starting namespace controller
I0715 13:25:23.844246       1 shared_informer.go:223] Waiting for caches to sync for namespace
I0715 13:25:24.080937       1 controllermanager.go:533] Started "replicaset"
I0715 13:25:24.081125       1 replica_set.go:181] Starting replicaset controller
I0715 13:25:24.081140       1 shared_informer.go:223] Waiting for caches to sync for ReplicaSet
I0715 13:25:24.347794       1 controllermanager.go:533] Started "csrcleaner"
I0715 13:25:24.347880       1 cleaner.go:82] Starting CSR cleaner controller
I0715 13:25:24.357656       1 node_lifecycle_controller.go:384] Sending events to api server.
I0715 13:25:24.358009       1 taint_manager.go:163] Sending events to api server.
I0715 13:25:24.358134       1 node_lifecycle_controller.go:512] Controller will reconcile labels.
I0715 13:25:24.358221       1 controllermanager.go:533] Started "nodelifecycle"
I0715 13:25:24.358443       1 node_lifecycle_controller.go:546] Starting node controller
I0715 13:25:24.358468       1 shared_informer.go:223] Waiting for caches to sync for taint
I0715 13:25:24.369712       1 node_lifecycle_controller.go:78] Sending events to api server
E0715 13:25:24.369788       1 core.go:229] failed to start cloud node lifecycle controller: no cloud provider provided
W0715 13:25:24.369808       1 controllermanager.go:525] Skipping "cloud-node-lifecycle"
I0715 13:25:24.378549       1 controllermanager.go:533] Started "attachdetach"
I0715 13:25:24.378886       1 attach_detach_controller.go:348] Starting attach detach controller
I0715 13:25:24.378943       1 shared_informer.go:223] Waiting for caches to sync for attach detach
I0715 13:25:24.386571       1 controllermanager.go:533] Started "clusterrole-aggregation"
I0715 13:25:24.386735       1 clusterroleaggregation_controller.go:149] Starting ClusterRoleAggregator
I0715 13:25:24.386846       1 shared_informer.go:223] Waiting for caches to sync for ClusterRoleAggregator
I0715 13:25:24.393561       1 controllermanager.go:533] Started "daemonset"
I0715 13:25:24.393758       1 daemon_controller.go:257] Starting daemon sets controller
I0715 13:25:24.393986       1 shared_informer.go:223] Waiting for caches to sync for daemon sets
I0715 13:25:24.401043       1 controllermanager.go:533] Started "csrapproving"
I0715 13:25:24.401086       1 certificate_controller.go:119] Starting certificate controller "csrapproving"
I0715 13:25:24.401262       1 shared_informer.go:223] Waiting for caches to sync for certificate-csrapproving
I0715 13:25:24.408226       1 controllermanager.go:533] Started "persistentvolume-binder"
I0715 13:25:24.408378       1 pv_controller_base.go:295] Starting persistent volume controller
I0715 13:25:24.408391       1 shared_informer.go:223] Waiting for caches to sync for persistent volume
I0715 13:25:24.415913       1 controllermanager.go:533] Started "replicationcontroller"
I0715 13:25:24.416043       1 replica_set.go:181] Starting replicationcontroller controller
I0715 13:25:24.416060       1 shared_informer.go:223] Waiting for caches to sync for ReplicationController
I0715 13:25:24.955559       1 controllermanager.go:533] Started "garbagecollector"
I0715 13:25:24.956128       1 garbagecollector.go:133] Starting garbage collector controller
I0715 13:25:24.956232       1 shared_informer.go:223] Waiting for caches to sync for garbage collector
I0715 13:25:24.956332       1 graph_builder.go:282] GraphBuilder running
I0715 13:25:24.967173       1 controllermanager.go:533] Started "cronjob"
I0715 13:25:24.967242       1 cronjob_controller.go:97] Starting CronJob Manager
I0715 13:25:24.976608       1 controllermanager.go:533] Started "bootstrapsigner"
I0715 13:25:24.976678       1 shared_informer.go:223] Waiting for caches to sync for bootstrap_signer
E0715 13:25:24.984131       1 core.go:89] Failed to start service controller: WARNING: no cloud provider provided, services of type LoadBalancer will fail
W0715 13:25:24.984160       1 controllermanager.go:525] Skipping "service"
I0715 13:25:25.045966       1 controllermanager.go:533] Started "pv-protection"
I0715 13:25:25.046065       1 pv_protection_controller.go:83] Starting PV protection controller
I0715 13:25:25.046257       1 shared_informer.go:223] Waiting for caches to sync for PV protection
I0715 13:25:25.192772       1 controllermanager.go:533] Started "podgc"
I0715 13:25:25.192853       1 gc_controller.go:89] Starting GC controller
I0715 13:25:25.192871       1 shared_informer.go:223] Waiting for caches to sync for GC
I0715 13:25:25.343639       1 controllermanager.go:533] Started "csrsigning"
I0715 13:25:25.343733       1 certificate_controller.go:119] Starting certificate controller "csrsigning"
I0715 13:25:25.343744       1 shared_informer.go:223] Waiting for caches to sync for certificate-csrsigning
I0715 13:25:25.343767       1 dynamic_serving_content.go:130] Starting csr-controller::/etc/kubernetes/ssl/ca.crt::/etc/kubernetes/ssl/ca.key
I0715 13:25:25.492841       1 controllermanager.go:533] Started "tokencleaner"
I0715 13:25:25.492879       1 core.go:239] Will not configure cloud provider routes for allocate-node-cidrs: true, configure-cloud-routes: false.
W0715 13:25:25.492920       1 controllermanager.go:525] Skipping "route"
W0715 13:25:25.492934       1 controllermanager.go:525] Skipping "ttl-after-finished"
W0715 13:25:25.492953       1 controllermanager.go:525] Skipping "root-ca-cert-publisher"
I0715 13:25:25.493041       1 tokencleaner.go:118] Starting token cleaner controller
I0715 13:25:25.493057       1 shared_informer.go:223] Waiting for caches to sync for token_cleaner
I0715 13:25:25.493089       1 shared_informer.go:230] Caches are synced for token_cleaner
I0715 13:25:25.643329       1 controllermanager.go:533] Started "endpoint"
I0715 13:25:25.643360       1 endpoints_controller.go:182] Starting endpoint controller
I0715 13:25:25.643401       1 shared_informer.go:223] Waiting for caches to sync for endpoint
I0715 13:25:25.992701       1 request.go:621] Throttling request took 1.032016097s, request: GET:https://192.168.30.118:6443/apis/autoscaling/v1?timeout=32s
I0715 13:25:26.044487       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for events.events.k8s.io
I0715 13:25:26.044531       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for horizontalpodautoscalers.autoscaling
I0715 13:25:26.044551       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for ingresses.networking.k8s.io
I0715 13:25:26.044576       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for controllerrevisions.apps
I0715 13:25:26.044595       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for deployments.apps
I0715 13:25:26.044900       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for roles.rbac.authorization.k8s.io
I0715 13:25:26.044930       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for leases.coordination.k8s.io
I0715 13:25:26.045108       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for endpoints
I0715 13:25:26.045175       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for jobs.batch
I0715 13:25:26.045313       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for statefulsets.apps
I0715 13:25:26.045358       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for limitranges
I0715 13:25:26.045374       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for replicasets.apps
I0715 13:25:26.045427       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for cronjobs.batch
I0715 13:25:26.045454       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for networkpolicies.networking.k8s.io
I0715 13:25:26.045551       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for poddisruptionbudgets.policy
I0715 13:25:26.045693       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for serviceaccounts
I0715 13:25:26.045743       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for ingresses.extensions
I0715 13:25:26.045847       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for daemonsets.apps
I0715 13:25:26.045897       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for rolebindings.rbac.authorization.k8s.io
I0715 13:25:26.045987       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for podtemplates
I0715 13:25:26.046082       1 resource_quota_monitor.go:228] QuotaMonitor created object count evaluator for endpointslices.discovery.k8s.io
I0715 13:25:26.046108       1 controllermanager.go:533] Started "resourcequota"
I0715 13:25:26.046193       1 resource_quota_controller.go:272] Starting resource quota controller
I0715 13:25:26.046297       1 shared_informer.go:223] Waiting for caches to sync for resource quota
I0715 13:25:26.046340       1 resource_quota_monitor.go:303] QuotaMonitor running
I0715 13:25:26.393302       1 controllermanager.go:533] Started "horizontalpodautoscaling"
I0715 13:25:26.393557       1 horizontal.go:169] Starting HPA controller
I0715 13:25:26.393571       1 shared_informer.go:223] Waiting for caches to sync for HPA
I0715 13:25:26.692426       1 controllermanager.go:533] Started "disruption"
I0715 13:25:26.692513       1 disruption.go:331] Starting disruption controller
I0715 13:25:26.692541       1 shared_informer.go:223] Waiting for caches to sync for disruption
I0715 13:25:26.842858       1 controllermanager.go:533] Started "statefulset"
I0715 13:25:26.843027       1 stateful_set.go:146] Starting stateful set controller
I0715 13:25:26.843046       1 shared_informer.go:223] Waiting for caches to sync for stateful set
I0715 13:25:26.993160       1 controllermanager.go:533] Started "endpointslice"
I0715 13:25:26.993286       1 endpointslice_controller.go:213] Starting endpoint slice controller
I0715 13:25:26.993318       1 shared_informer.go:223] Waiting for caches to sync for endpoint_slice
I0715 13:25:27.143635       1 controllermanager.go:533] Started "job"
I0715 13:25:27.143705       1 job_controller.go:144] Starting job controller
I0715 13:25:27.143713       1 shared_informer.go:223] Waiting for caches to sync for job
I0715 13:25:27.293272       1 controllermanager.go:533] Started "ttl"
I0715 13:25:27.293386       1 ttl_controller.go:118] Starting TTL controller
I0715 13:25:27.293398       1 shared_informer.go:223] Waiting for caches to sync for TTL
I0715 13:25:27.442916       1 node_ipam_controller.go:94] Sending events to api server.
I0715 13:25:37.752535       1 range_allocator.go:82] Sending events to api server.
I0715 13:25:37.752782       1 range_allocator.go:116] No Secondary Service CIDR provided. Skipping filtering out secondary service addresses.
I0715 13:25:37.753109       1 controllermanager.go:533] Started "nodeipam"
I0715 13:25:37.753495       1 node_ipam_controller.go:162] Starting ipam controller
I0715 13:25:37.753530       1 shared_informer.go:223] Waiting for caches to sync for node
I0715 13:25:37.773147       1 controllermanager.go:533] Started "serviceaccount"
I0715 13:25:37.773474       1 serviceaccounts_controller.go:117] Starting service account controller
I0715 13:25:37.773490       1 shared_informer.go:223] Waiting for caches to sync for service account
I0715 13:25:39.682830       1 controllermanager.go:533] Started "deployment"
I0715 13:25:39.683163       1 deployment_controller.go:153] Starting deployment controller
I0715 13:25:39.683244       1 shared_informer.go:223] Waiting for caches to sync for deployment
I0715 13:25:43.730355       1 controllermanager.go:533] Started "persistentvolume-expander"
I0715 13:25:43.730579       1 expand_controller.go:319] Starting expand controller
I0715 13:25:43.730608       1 shared_informer.go:223] Waiting for caches to sync for expand
I0715 13:25:45.141496       1 controllermanager.go:533] Started "pvc-protection"
I0715 13:25:45.141534       1 pvc_protection_controller.go:101] Starting PVC protection controller
I0715 13:25:45.141551       1 shared_informer.go:223] Waiting for caches to sync for PVC protection
I0715 13:25:45.144940       1 shared_informer.go:223] Waiting for caches to sync for garbage collector
I0715 13:25:45.146805       1 shared_informer.go:223] Waiting for caches to sync for resource quota
W0715 13:25:45.168537       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="fcos-master-dev" does not exist
W0715 13:25:45.168760       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="fcos-slave1-dev" does not exist
W0715 13:25:45.170562       1 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="fcos-slave2-dev" does not exist
I0715 13:25:45.173662       1 shared_informer.go:230] Caches are synced for service account
I0715 13:25:45.176900       1 shared_informer.go:230] Caches are synced for bootstrap_signer
I0715 13:25:45.179162       1 shared_informer.go:230] Caches are synced for attach detach
I0715 13:25:45.187102       1 shared_informer.go:230] Caches are synced for ClusterRoleAggregator
I0715 13:25:45.192996       1 shared_informer.go:230] Caches are synced for GC
I0715 13:25:45.193476       1 shared_informer.go:230] Caches are synced for endpoint_slice
I0715 13:25:45.194527       1 shared_informer.go:230] Caches are synced for HPA
I0715 13:25:45.194567       1 shared_informer.go:230] Caches are synced for TTL
I0715 13:25:45.194768       1 shared_informer.go:230] Caches are synced for daemon sets
I0715 13:25:45.207187       1 shared_informer.go:230] Caches are synced for certificate-csrapproving
I0715 13:25:45.216229       1 shared_informer.go:230] Caches are synced for ReplicationController
I0715 13:25:45.241793       1 shared_informer.go:230] Caches are synced for PVC protection
I0715 13:25:45.243355       1 shared_informer.go:230] Caches are synced for stateful set
I0715 13:25:45.243875       1 shared_informer.go:230] Caches are synced for certificate-csrsigning
I0715 13:25:45.244362       1 shared_informer.go:230] Caches are synced for namespace
I0715 13:25:45.244444       1 shared_informer.go:230] Caches are synced for job
I0715 13:25:45.246481       1 shared_informer.go:230] Caches are synced for PV protection
I0715 13:25:45.253727       1 shared_informer.go:230] Caches are synced for node
I0715 13:25:45.253771       1 range_allocator.go:172] Starting range CIDR allocator
I0715 13:25:45.253795       1 shared_informer.go:223] Waiting for caches to sync for cidrallocator
I0715 13:25:45.253801       1 shared_informer.go:230] Caches are synced for cidrallocator
I0715 13:25:45.258720       1 shared_informer.go:230] Caches are synced for taint
I0715 13:25:45.258944       1 node_lifecycle_controller.go:1433] Initializing eviction metric for zone:
I0715 13:25:45.258902       1 taint_manager.go:187] Starting NoExecuteTaintManager
W0715 13:25:45.259154       1 node_lifecycle_controller.go:1048] Missing timestamp for Node fcos-slave1-dev. Assuming now as a timestamp.
W0715 13:25:45.259263       1 node_lifecycle_controller.go:1048] Missing timestamp for Node fcos-slave2-dev. Assuming now as a timestamp.
I0715 13:25:45.259215       1 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fcos-slave1-dev", UID:"bb924c54-0697-432f-a9f3-ea51dbc8bf97", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node fcos-slave1-dev event: Registered Node fcos-slave1-dev in Controller
I0715 13:25:45.259342       1 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fcos-slave2-dev", UID:"8727b383-f0f8-4c2f-a2ca-2aac710085a8", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node fcos-slave2-dev event: Registered Node fcos-slave2-dev in Controller
I0715 13:25:45.259355       1 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fcos-master-dev", UID:"dc3d11d9-b56d-4787-9590-eaa61f8cc253", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node fcos-master-dev event: Registered Node fcos-master-dev in Controller
W0715 13:25:45.259324       1 node_lifecycle_controller.go:1048] Missing timestamp for Node fcos-master-dev. Assuming now as a timestamp.
I0715 13:25:45.259472       1 node_lifecycle_controller.go:1249] Controller detected that zone  is now in state Normal.
I0715 13:25:45.343598       1 shared_informer.go:230] Caches are synced for endpoint
I0715 13:25:45.382536       1 shared_informer.go:230] Caches are synced for ReplicaSet
I0715 13:25:45.583482       1 shared_informer.go:230] Caches are synced for deployment
I0715 13:25:45.592706       1 shared_informer.go:230] Caches are synced for disruption
I0715 13:25:45.592741       1 disruption.go:339] Sending events to api server.
I0715 13:25:45.808565       1 shared_informer.go:230] Caches are synced for persistent volume
I0715 13:25:45.830840       1 shared_informer.go:230] Caches are synced for expand
I0715 13:25:46.745143       1 shared_informer.go:230] Caches are synced for garbage collector
I0715 13:25:46.746586       1 shared_informer.go:230] Caches are synced for resource quota
I0715 13:25:46.746984       1 shared_informer.go:230] Caches are synced for resource quota
I0715 13:25:46.756490       1 shared_informer.go:230] Caches are synced for garbage collector
I0715 13:25:46.756533       1 garbagecollector.go:142] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
W0715 13:25:55.198189       1 endpointslice_controller.go:260] Error syncing endpoint slices for service "kube-system/coredns", retrying. Error: Error creating EndpointSlice for Service kube-system/coredns: Internal error occurred: resource quota evaluates timeout
W0715 13:25:55.198435       1 endpointslice_controller.go:260] Error syncing endpoint slices for service "kube-system/dashboard-metrics-scraper", retrying. Error: Error creating EndpointSlice for Service kube-system/dashboard-metrics-scraper: Internal error occurred: resource quota evaluates timeout
W0715 13:25:55.198774       1 endpointslice_controller.go:260] Error syncing endpoint slices for service "kube-system/kubernetes-dashboard", retrying. Error: Error creating EndpointSlice for Service kube-system/kubernetes-dashboard: Internal error occurred: resource quota evaluates timeout
W0715 13:25:55.199017       1 endpointslice_controller.go:260] Error syncing endpoint slices for service "gitlab-managed-apps/tiller-deploy", retrying. Error: Error creating EndpointSlice for Service gitlab-managed-apps/tiller-deploy: Internal error occurred: resource quota evaluates timeout
I0715 13:25:55.199215       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"kube-system", Name:"dashboard-metrics-scraper", UID:"53a1397a-0a38-4379-a3da-349327eace3f", APIVersion:"v1", ResourceVersion:"2285", FieldPath:""}): type: 'Warning' reason: 'FailedToUpdateEndpointSlices' Error updating Endpoint Slices for Service kube-system/dashboard-metrics-scraper: Error creating EndpointSlice for Service kube-system/dashboard-metrics-scraper: Internal error occurred: resource quota evaluates timeout
I0715 13:25:55.199244       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"kube-system", Name:"coredns", UID:"946a067c-0f31-4097-b010-9f59bc5c9ac1", APIVersion:"v1", ResourceVersion:"2244", FieldPath:""}): type: 'Warning' reason: 'FailedToUpdateEndpointSlices' Error updating Endpoint Slices for Service kube-system/coredns: Error creating EndpointSlice for Service kube-system/coredns: Internal error occurred: resource quota evaluates timeout
I0715 13:25:55.199256       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"kube-system", Name:"kubernetes-dashboard", UID:"50017748-677e-4099-a35d-5049e2360da2", APIVersion:"v1", ResourceVersion:"2282", FieldPath:""}): type: 'Warning' reason: 'FailedToUpdateEndpointSlices' Error updating Endpoint Slices for Service kube-system/kubernetes-dashboard: Error creating EndpointSlice for Service kube-system/kubernetes-dashboard: Internal error occurred: resource quota evaluates timeout
I0715 13:25:55.199272       1 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"gitlab-managed-apps", Name:"tiller-deploy", UID:"97e0de27-cb08-4fc7-a508-f24336a57fe0", APIVersion:"v1", ResourceVersion:"21317", FieldPath:""}): type: 'Warning' reason: 'FailedToUpdateEndpointSlices' Error updating Endpoint Slices for Service gitlab-managed-apps/tiller-deploy: Error creating EndpointSlice for Service gitlab-managed-apps/tiller-deploy: Internal error occurred: resource quota evaluates timeout
E0715 13:25:57.133041       1 leaderelection.go:320] error retrieving resource lock kube-system/kube-controller-manager: Get https://192.168.30.118:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
I0715 13:25:57.133127       1 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F0715 13:25:57.133220       1 controllermanager.go:279] leaderelection lost

kinbug

Source

hgye

Most helpful comment

Finally I know the reason, which is because the network issue. The leadership lost.

I0716 11:08:05.725721       1 leaderelection.go:287] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0716 11:08:05.726270       1 controllermanager.go:279] leaderelection lost

According to kuberneters document:

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
--leader-elect-lease-duration duration Default: 15s
--leader-elect-renew-deadline duration Default: 10s

After update file
/etc/kubernetes/manifests/kube-controller-manager.yaml
with value "--leader-elect-lease-duration --leader-elect-renew-deadline"

This problem fixed.

So I think need ability to tune this value in k8s-cluster.yaml file

hgye on 17 Jul 2020

👍2

All 6 comments

endpoints

kubectl get endpoints --all-namespaces                           
NAMESPACE             NAME                        ENDPOINTS                                                     AGE
default               kubernetes                  192.168.30.118:6443                                           9h
gitlab-managed-apps   tiller-deploy               10.233.79.40:44134                                            3h20m
kube-system           coredns                     10.233.68.0:53,10.233.79.44:53,10.233.68.0:9153 + 3 more...   9h
kube-system           dashboard-metrics-scraper   10.233.79.41:8000                                             9h
kube-system           kube-controller-manager     <none>                                                        9h
kube-system           kube-scheduler              <none>                                                        9h
kube-system           kubernetes-dashboard        10.233.79.43:8443                                             9h

I'm not sure why kube-controller-manager and kube-scheduler are .

hgye on 15 Jul 2020

I'm not sure this can be pinned to kubespray, looks like there already is some kubernetes issue about this weird one
https://github.com/kubernetes/kubernetes/issues/74340

floryut on 16 Jul 2020

Finally I know the reason, which is because the network issue. The leadership lost.

I0716 11:08:05.725721       1 leaderelection.go:287] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0716 11:08:05.726270       1 controllermanager.go:279] leaderelection lost

According to kuberneters document:

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
--leader-elect-lease-duration duration Default: 15s
--leader-elect-renew-deadline duration Default: 10s

After update file
/etc/kubernetes/manifests/kube-controller-manager.yaml
with value "--leader-elect-lease-duration --leader-elect-renew-deadline"

This problem fixed.

So I think need ability to tune this value in k8s-cluster.yaml file

hgye on 17 Jul 2020

👍2

Finally I know the reason, which is because the network issue. The leadership lost.
I0716 11:08:05.725721       1 leaderelection.go:287] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0716 11:08:05.726270       1 controllermanager.go:279] leaderelection lost
According to kuberneters document:

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
--leader-elect-lease-duration duration Default: 15s
--leader-elect-renew-deadline duration Default: 10s

After update file
/etc/kubernetes/manifests/kube-controller-manager.yaml
with value "--leader-elect-lease-duration --leader-elect-renew-deadline"

This problem fixed.

So I think need ability to tune this value in k8s-cluster.yaml file

If you care to submit a PR this would be very welcome

floryut on 17 Jul 2020

/close

floryut on 17 Jul 2020

@floryut: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.