Hello
I found that after draining a worker node with:
kubectl drain sdbit-k8s-worker1 --ignore-daemonsets=true
the node is correctly drained.
After a reboot, if the node is uncordoned with:
kubectl uncordon sdbit-k8s-worker1
the node becomes NotReady and never becomes Ready again.
All the docker containers are missing ( docker ps returns no running container) and kubelet doesn't start because it complains that it can't reach the API server:
kubelet_node_status.go:94] Unable to register node "sdbit-k8s-worker1" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: con
kubelet.go:2267] node "sdbit-k8s-worker1" not found
kubelet.go:2267] node "sdbit-k8s-worker1" not found
reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtime
kubelet.go:2267] node "sdbit-k8s-worker1" not found
Nginx pod shows:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 62m (x12 over 64m) kubelet, sdbit-k8s-worker1 Failed create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory
Normal SandboxChanged 61m (x13 over 64m) kubelet, sdbit-k8s-worker1 Pod sandbox changed, it will be killed and re-created.
It looks like kubelet cannot start because of nginx proxy not being able to redirect the traffic to the API server.
Environment:
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 4.15.0-88-generic x86_64
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Version of Ansible (ansible --version):
ansible 2.7.12
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/kubespray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
Version of Python (python --version):
Python 2.7.17
Kubespray version (commit) (git rev-parse --short HEAD):
34e883e6
Network plugin used:
Calico
Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/irizzant/53f34f02e8b857f1209bab102a67c565
Command used to invoke ansible:
ansible-playbook -b -v -i inventory/sample/hosts.yaml upgrade-cluster.yml
Output of ansible run:
https://gist.github.com/irizzant/9bfa9aec42fb1f85c4003a42b50d7c11
Anything else do we need to know:
This is caused by systemd-resolved service, which was disabled by mistake on our nodes.
This service is needed to have /etc/resolv.conf file, which is in turn needed by Kubelet.
Kubelet refused to start on our nodes on reboot, and the log showed that the resolv.conf file was missing.
Since Kubelet didn't restart the needed docker containers for Calico, Nginx and the like didn't start either and the node could not register to the cluster.
After changing the network configuration and enabling systemd-resolved service, everything went back to normal
Please excute the following commands after rebooting or dropping out of the service
$ sudo kubeadm reset
$ sudo swapoff -a
$ sudo kubeadm join聽YourNodeIPAddress聽--token --discovery-token-ca-cert-hash \
sha256...