Kubespray: K8s system pods fails due to liveness check not working

Created on 6 Aug 2020  ·  7Comments  ·  Source: kubernetes-sigs/kubespray

What happened:
kube-scheduler and kube-controller pods fail due to liveness checking not works. Liveness checking does not work because healtz check entrypoints for these pod was removed in 1.16.13 k8s (for kube-scheduler pod http://127.0.0.1:10251/healthz and for kube-controller pod http://127.0.0.1:10252/healthz )

What you expected to happen:
I expect k8s pods manifests will not contain liveness check is containers don’t have entry points for them.

How to reproduce it (as minimally and precisely as possible):
Deploy k8s using kubespray release-2.12 (https://github.com/kubernetes-sigs/kubespray/tree/release-2.12) with default k8s version.

Anything else we need to know?:

Environment:

  • Cloud provider or hardware configuration:
    AWS
  • OS cat /etc/os-release:
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Version of Ansible (ansible --version):
    ansible 2.7.16
    config file = None
    configured module search path = ['/home/centos/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
    ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
    executable location = /usr/local/bin/ansible
    python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

  • Version of Python (python --version):
    [centos@ip-172-31-15-227 ~]$ python --version
    Python 2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
2acc5a7

Network plugin used:
Tungsten Fabric, Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
all:
hosts:
node1:
ansible_host: 172.31.15.227
ip: 172.31.15.227
access_ip: 172.31.15.227
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -e kube_pods_subnet=10.32.0.0/12 -e kube_service_addresses=10.96.0.0/12

kinbug

Most helpful comment

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}         

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

All 7 comments

Additionally. The report from k8s repo about this bug. The ask me to report here with this issue: https://github.com/kubernetes/kubernetes/issues/93746

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}         

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

same issue here after upgrade from v1.18.5 to v1.18.6

Edit: Reproduced also on a clean install (v2.14.0)
Server Version: v1.18.8 on Debian10
Output

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0               Healthy     {"health":"true"}                                                 
etcd-1               Healthy     {"health":"true"}                                                 
etcd-2               Healthy     {"health":"true"}  

Cluster seems to work fine, though.

Hi, i´m having the same issue in the master, this worked for me
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

But when running again cluster.yml this confs are not persisted

Seems to be fixed in Kubernetes 1.16.14: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.16.md#changelog-since-v11613

Fixed a regression in kubeadm manifests for kube-scheduler and kube-controller-manager which caused continuous restarts because of failing health checks (#93208, @SataQiu) [SIG Cluster Lifecycle]

I will create a PR for using the fixed 1.16.14 version very soon.
Until then everybody should also be able to just fix the livenessprobe instead of reenabling the insecure liveness check ports e.g. with something like a basic playbook like

- hosts: kube-master
  gather_facts: false
  tasks:
  - name: kube-controller-manager - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: '10252'
      replace: '10257'
  - name: kube-controller-manager - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'
  - name: Wait a few seconds as too fast updates don't tear down the previous version correctly
    pause:
      seconds: 10
  - name: kube-scheduler - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: '10251'
      replace: '10259'
  - name: kube-scheduler - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'
❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}         

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

Thanks!
Workaround works for me.

Was this page helpful?
0 / 5 - 0 ratings