Kubespray: Nodes in NotReady status after reboot

Created on 26 Feb 2020 · 2Comments · Source: kubernetes-sigs/kubespray

Hello

I found that after draining a worker node with:
kubectl drain sdbit-k8s-worker1 --ignore-daemonsets=true

the node is correctly drained.

After a reboot, if the node is uncordoned with:
kubectl uncordon sdbit-k8s-worker1

the node becomes NotReady and never becomes Ready again.

All the docker containers are missing ( docker ps returns no running container) and kubelet doesn't start because it complains that it can't reach the API server:

kubelet_node_status.go:94] Unable to register node "sdbit-k8s-worker1" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: con
kubelet.go:2267] node "sdbit-k8s-worker1" not found
kubelet.go:2267] node "sdbit-k8s-worker1" not found
reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtime
kubelet.go:2267] node "sdbit-k8s-worker1" not found

Nginx pod shows:

  Type     Reason                  Age                 From                        Message
  ----     ------                  ----                ----                        -------
  Warning  FailedCreatePodSandBox  62m (x12 over 64m)  kubelet, sdbit-k8s-worker1  Failed create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory
  Normal   SandboxChanged          61m (x13 over 64m)  kubelet, sdbit-k8s-worker1  Pod sandbox changed, it will be killed and re-created.

It looks like kubelet cannot start because of nginx proxy not being able to redirect the traffic to the API server.

Environment:

Cloud provider or hardware configuration:
Baremetal cluster

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 4.15.0-88-generic x86_64
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Version of Ansible (ansible --version):
ansible 2.7.12
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/kubespray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
Version of Python (python --version):
Python 2.7.17

Kubespray version (commit) (git rev-parse --short HEAD):
34e883e6

Network plugin used:
Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/irizzant/53f34f02e8b857f1209bab102a67c565

Command used to invoke ansible:
ansible-playbook -b -v -i inventory/sample/hosts.yaml upgrade-cluster.yml

Output of ansible run:
https://gist.github.com/irizzant/9bfa9aec42fb1f85c4003a42b50d7c11

Anything else do we need to know:

kinbug

Source

irizzant

All 2 comments

This is caused by systemd-resolved service, which was disabled by mistake on our nodes.
This service is needed to have /etc/resolv.conf file, which is in turn needed by Kubelet.

Kubelet refused to start on our nodes on reboot, and the log showed that the resolv.conf file was missing.
Since Kubelet didn't restart the needed docker containers for Calico, Nginx and the like didn't start either and the node could not register to the cluster.

After changing the network configuration and enabling systemd-resolved service, everything went back to normal

irizzant on 9 May 2020

Please excute the following commands after rebooting or dropping out of the service

1. Reset your node

$ sudo kubeadm reset

2. Turn off the swap

$ sudo swapoff -a

3. start kubeadm join

$ sudo kubeadm join YourNodeIPAddress --token --discovery-token-ca-cert-hash \
sha256...

mikechen66 on 25 May 2020

👎6

Was this page helpful?

0 / 5 - 0 ratings

Related issues

using limit tag to add a node into current cluster

zouyee · 3Comments

No package matching 'python-httplib2' found available, installed or updated

ionsquare · 4Comments

Error with Calico wait for etcd

mattdornfeld · 4Comments

etcd remove-node failing and ignoring the errors

lacebal · 3Comments

netchecker: Error occurred while checking the agents. Details: unknown (get agents.network-checker.ext netchecker-agent-xxxxx)

TurboTim · 4Comments