Kubespray: adding worker node is failing with kubespray v2.8.2

Created on 5 Sep 2019 · 10Comments · Source: kubernetes-sigs/kubespray

Environment:

Cloud provider or hardware configuration:
Observed on GCP VMs as well as on bare-metal servers.

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
$ printf "$(uname -srm)n$(cat /etc/os-release)n"
Linux 4.15.0-1040-gcp x86_64
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Version of Ansible (ansible --version):
ansible version is same which is mentioned in kubespray/requirements.txt. installed using
pip install -r kubespray/requirements.txt
$ ansible --version
ansible 2.8.3
config file = None
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 2.7.15+ (default, Nov 27 2018, 23:36:35) [GCC 7.3.0]

Kubespray version (commit) (git rev-parse --short HEAD):
git clone -b v2.8.2 https://github.com/kubernetes-sigs/kubespray.git kubespray
$ git rev-parse --short HEAD
4167807f

Network plugin used:
flannel

Copy of your inventory file:
k8s-host.ini
[all]
test-node-1 ansible_host=35.231.123.78 ip=10.142.15.197 ansible_ssh_user=ubuntu
test-node-2 ansible_host=35.211.242.98 ansible_ssh_user=ubuntu

[kube-master]
test-node-1

[kube-node]
test-node-1
test-node-2

[etcd]
test-node-1

[k8s-cluster:children]
kube-master
kube-node

extra-vars.yaml:
provider: gcp
kube_proxy_mode: iptables
docker_version: 18.09
kube_network_plugin: flannel
docker_iptables_enabled: "true"
cluster_name: cluster.local
kubelet_fail_swap_on: false
upstream_dns_servers:
- 8.8.8.8
kube_apiserver_enable_admission_plugins: ["NodeRestriction","PodNodeSelector"]

Command used to invoke ansible:
ansible-playbook -i inventory/sample/k8s-host.ini --become --flush-cache scale.yml [email protected] --private-key=./ssh-rsa

Output of ansible run:
Last few lines (above that no failure found):
TASK [kubernetes/node : Cleanup kube-proxy leftovers from node] ***************************
Thursday 05 September 2019 09:35:12 +0000 (0:00:01.500) 0:05:48.478 ***

TASK [kubernetes/node : include_tasks] ************************************
Thursday 05 September 2019 09:35:13 +0000 (0:00:00.128) 0:05:48.606 ***

TASK [kubernetes/node : Write cacert file] **********************************
Thursday 05 September 2019 09:35:13 +0000 (0:00:00.139) 0:05:48.746 ***

TASK [kubernetes/node : Write cloud-config] **********************************
Thursday 05 September 2019 09:35:13 +0000 (0:00:00.135) 0:05:48.882 **

RUNNING HANDLER [network_plugin/cilium : restart kubelet] *****************************
Thursday 05 September 2019 09:35:13 +0000 (0:00:00.085) 0:05:48.967 ***

TASK [kubernetes/node : Enable kubelet] ***********************************
Thursday 05 September 2019 09:35:13 +0000 (0:00:00.125) 0:05:49.093 ***
fatal: [test-node-2]: FAILED! => {"changed": false, "msg": "Could not find the requested service kubelet: host"}
ok: [test-node-1]

NO MORE HOSTS LEFT *******************************************

PLAY RECAP *********************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
test-node-1 : ok=240 changed=9 unreachable=0 failed=0 skipped=231 rescued=0 ignored=0
test-node-2 : ok=200 changed=24 unreachable=0 failed=1 skipped=151 rescued=0 ignored=0

Thursday 05 September 2019 09:35:15 +0000 (0:00:01.559) 0:05:50.652 **

container-engine/docker : Docker | pause while Docker restarts ------------------------------------------------------------------------------- 10.12s
bootstrap-os : Gather nodes hostnames --------------------------------------------------------------------------------------------------------- 7.94s
bootstrap-os : Assign inventory name to unconfigured hostnames (non-CoreOS and Tumbleweed) ---------------------------------------------------- 7.88s
bootstrap-os : Remove require tty ------------------------------------------------------------------------------------------------------------- 7.50s
container-engine/docker : Write docker options systemd drop-in -------------------------------------------------------------------------------- 6.89s
bootstrap-os : Create remote_tmp for it is used by another module ----------------------------------------------------------------------------- 6.88s
container-engine/docker : Write docker dns systemd drop-in ------------------------------------------------------------------------------------ 6.56s
kubernetes/node : install | Write kubelet systemd init file ----------------------------------------------------------------------------------- 6.47s
kubernetes/preinstall : Create kubernetes directories ----------------------------------------------------------------------------------------- 6.46s
kubernetes/node : Write kubelet config file (kubeadm) ----------------------------------------------------------------------------------------- 6.27s
kubernetes/node : nginx-proxy | Write nginx-proxy configuration ------------------------------------------------------------------------------- 6.27s
kubernetes/node : nginx-proxy | Write static pod ---------------------------------------------------------------------------------------------- 6.24s
container-engine/docker : Ensure old versions of Docker are not installed. | Debian ----------------------------------------------------------- 5.90s
kubernetes/preinstall : Update package management cache (APT) --------------------------------------------------------------------------------- 5.52s
container-engine/docker : Set docker pin priority to apt_preferences on Debian family --------------------------------------------------------- 5.01s
container-engine/docker : Write docker.service systemd file ----------------------------------------------------------------------------------- 4.41s
kubernetes/node : Persist br_netfilter module ------------------------------------------------------------------------------------------------- 4.29s
container-engine/docker : ensure docker packages are installed -------------------------------------------------------------------------------- 3.79s
kubernetes/preinstall : Create cni directories ------------------------------------------------------------------------------------------------ 3.47s
kubernetes/node : Enable bridge-nf-call tables ------------------------------------------------------------------------------------------------ 3.47s

Anything else do we need to know:

I installed multi-node cluster (1-master (schedulable), 1-worker) using kubespray, which worked fine
after that i removed worker node from the cluster and verified cluster, it worked. node was removed successfully also /etc/kubernetes/ directory was removed from worker node.
then i tried to add the same worker node, that is always failing with the logs mentioned above.
after checking on worker node i found that /etc/kubetnetes/ssl/ directory was empty. i.e the expected ca-crt file was not available there. also /etc/kubetnetes/ directory dint have some files which are expected to be present there;
Present:
$ ls /etc/kubernetes/
kubelet.env manifests ssl

Expected:
$ ls /etc/kubernetes/
bootstrap-kubelet.conf kubeadm-client.v1alpha3.conf kubelet.conf kubelet.conf.7338.2019-08-17@03:22:52~ kubelet.env manifests ssl

when i try to add fresh node it works but the adding node which is removed from the cluster fails.

kinbug lifecyclrotten

Source

harshalkwagh

Most helpful comment

Kubespray 2.11.2, Ansible 2.9.6.
The same problem. Have to go to the node and run systemctl daemon-reload manually, that helps. The code seems to be correct from the first sight.

asubb on 7 Apr 2020

👍4

All 10 comments

Did you try to add the node with cluster.yml rather than scale.yml ?

alijahnas on 10 Sep 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 Dec 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 8 Jan 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 7 Feb 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 7 Feb 2020

This issue is still observed with kubespray v2.12.0.
newly adding node has kubelet.service file
$ ls /etc/systemd/system/kubelet.service
/etc/systemd/system/kubelet.service

but service is not getting detected
$ systemctl status kubelet.service
Unit kubelet.service could not be found.

harshalkwagh on 3 Mar 2020

Kubespray 2.11.2, Ansible 2.9.6.
The same problem. Have to go to the node and run systemctl daemon-reload manually, that helps. The code seems to be correct from the first sight.

asubb on 7 Apr 2020

👍4

Same problem here.
HEAD is 826a440f .

TASK [kubernetes/node : Enable kubelet] ****************************************************************************************************************************************************************************
fatal: [frd3kq-k8s01g]: FAILED! => {"changed": false, "msg": "Could not find the requested service kubelet: host"}

Same workaround.
ssh to the new worker node and systemctl daemon-reload :

[root@oshaemoo2ang4eic ~]# systemctl status kubelet
Unit kubelet.service could not be found.
[root@oshaemoo2ang4eic ~]# systemctl daemon-reload
[root@oshaemoo2ang4eic ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2020-04-22 15:18:43 CEST; 22min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 51525 (code=exited, status=0/SUCCESS)

ledroide on 22 Apr 2020

👍3

Hello, I had this issue with kubespray v2.13.0 on CentOS Linux 7.7.1908
For avoiding unnecessary costs on august two nodes were removed from the cluster and shutdown, four week later I powered them on and tried to reinsert into the cluster unsuccessfully with scale.yml
I had to create new VM as a workaround, unable to reuse previous ones.

iaas365-dev on 31 Aug 2020

Yeah, same issue when trying to add a server (debian buster) that was previously removed (kubespray 2.14) . It seems that removing then adding a node is an easy way to reproduce the issue.