Kubespray: K8s system pods fails due to liveness check not working

Created on 6 Aug 2020 · 7Comments · Source: kubernetes-sigs/kubespray

What happened:
kube-scheduler and kube-controller pods fail due to liveness checking not works. Liveness checking does not work because healtz check entrypoints for these pod was removed in 1.16.13 k8s (for kube-scheduler pod http://127.0.0.1:10251/healthz and for kube-controller pod http://127.0.0.1:10252/healthz )

What you expected to happen:
I expect k8s pods manifests will not contain liveness check is containers don’t have entry points for them.

How to reproduce it (as minimally and precisely as possible):
Deploy k8s using kubespray release-2.12 (https://github.com/kubernetes-sigs/kubespray/tree/release-2.12) with default k8s version.

Anything else we need to know?:
—

Environment:

Cloud provider or hardware configuration:
AWS
OS cat /etc/os-release:
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Version of Ansible (ansible --version):
ansible 2.7.16
config file = None
configured module search path = ['/home/centos/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Version of Python (python --version):
[centos@ip-172-31-15-227 ~]$ python --version
Python 2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
2acc5a7

Network plugin used:
Tungsten Fabric, Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
all:
hosts:
node1:
ansible_host: 172.31.15.227
ip: 172.31.15.227
access_ip: 172.31.15.227
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -e kube_pods_subnet=10.32.0.0/12 -e kube_service_addresses=10.96.0.0/12

kinbug

Source

ivanvtimofeev

Most helpful comment

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}

I have the same issue. "workaround" is to delete that port flag from the kubernetes manifests, but would be happy to have a better fix. Happened after i upgraded to Kubernetes 1.17.9 and release 2.13 a few days back.

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

eifelmicha on 8 Aug 2020

👍4

All 7 comments

Additionally. The report from k8s repo about this bug. The ask me to report here with this issue: https://github.com/kubernetes/kubernetes/issues/93746

ivanvtimofeev on 6 Aug 2020

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

eifelmicha on 8 Aug 2020

👍4

same issue here after upgrade from v1.18.5 to v1.18.6

Edit: Reproduced also on a clean install (v2.14.0)
Server Version: v1.18.8 on Debian10
Output

$ kubectl get componentstatus
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
etcd-0               Healthy     {"health":"true"}                                                 
etcd-1               Healthy     {"health":"true"}                                                 
etcd-2               Healthy     {"health":"true"}

Cluster seems to work fine, though.

paulcosma on 12 Aug 2020

Hi, i´m having the same issue in the master, this worked for me
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

But when running again cluster.yml this confs are not persisted

pedrohmuniz on 24 Aug 2020

Seems to be fixed in Kubernetes 1.16.14: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.16.md#changelog-since-v11613

Fixed a regression in kubeadm manifests for kube-scheduler and kube-controller-manager which caused continuous restarts because of failing health checks (#93208, @SataQiu) [SIG Cluster Lifecycle]

linkvt on 25 Aug 2020

I will create a PR for using the fixed 1.16.14 version very soon.
Until then everybody should also be able to just fix the livenessprobe instead of reenabling the insecure liveness check ports e.g. with something like a basic playbook like

- hosts: kube-master
  gather_facts: false
  tasks:
  - name: kube-controller-manager - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: '10252'
      replace: '10257'
  - name: kube-controller-manager - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-controller-manager.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'
  - name: Wait a few seconds as too fast updates don't tear down the previous version correctly
    pause:
      seconds: 10
  - name: kube-scheduler - Use secure port for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: '10251'
      replace: '10259'
  - name: kube-scheduler - Use HTTPS for liveness probe
    replace:
      path: /etc/kubernetes/manifests/kube-scheduler.yaml
      regexp: 'scheme: HTTP$'
      replace: 'scheme: HTTPS'

linkvt on 25 Aug 2020

👍3

❯ kubectl get componentstatuses
NAME                 STATUS      MESSAGE                                                                                     ERROR
scheduler            Unhealthy   Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused   
controller-manager   Unhealthy   Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused   
etcd-1               Healthy     {"health":"true"}                                                                           
etcd-0               Healthy     {"health":"true"}                                                                           
etcd-2               Healthy     {"health":"true"}

sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-controller-manager.yaml
sed -i '/- --port=0/d' /etc/kubernetes/manifests/kube-scheduler.yaml

Thanks!
Workaround works for me.

diogomurta on 2 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

fatal: [node1]: FAILED! => {"changed": false, "cmd": "apt-get update", "failed": true, "msg": "[Errno 2] No such file or directory", "rc": 2}

jonathan-kosgei · 4Comments

blocking: Get current version of calico cluster version

nghiepvo · 3Comments

No package matching 'python-httplib2' found available, installed or updated

ionsquare · 4Comments

Change default ingress-nginx config and replace the default "SSLv2" protocol

supabibz · 3Comments

Upgrade breaks calico: Node hostnames change

danielm0hr · 4Comments