Kubespray: Unable to Join Cluster

Created on 18 Apr 2019 · 13Comments · Source: kubernetes-sigs/kubespray

Environment:

Cloud provider or hardware configuration: Virtual Machine

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Ubuntu 16.0.4
Version of Ansible (ansible --version):2.7.10

Kubespray version (commit) (git rev-parse --short HEAD):6f919e5

Network plugin used: flannel

Copy of your inventory file:
[all]
master ansible_host=161.X.X.X ip=161.X.X.X ansible_user=raman ansible_sudo=yes
worker ansible_host=161.X.X.X ip=161.X.X.X ansible_user=raman ansible_sudo=yes

[kube-master]
master

[etcd]
master

[kube-node]
worker

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

Command used to invoke ansible:
ansible-playbook --flush-cache -i inventory/mycluster/inventory.ini cluster.yml --ask-pass --become --ask-become-pass

Output of ansible run:
TASK [kubernetes/kubeadm : Join to cluster] ******************************************************
Thursday 18 April 2019 22:18:57 +0530 (0:00:00.090) 0:03:18.044 *
fatal: [worker]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

TASK [kubernetes/kubeadm : Join to cluster with ignores] **************************************************
Thursday 18 April 2019 22:19:59 +0530 (0:01:01.563) 0:04:19.607 **
fatal: [worker]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

TASK [kubernetes/kubeadm : Display kubeadm join stderr if any] ************************************************
Thursday 18 April 2019 22:21:00 +0530 (0:01:01.543) 0:05:21.151 **
fatal: [worker]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/root/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 101, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}

Anything else do we need to know:
In worker node kubeadm is not able to join the cluster. I debugged and suspecting the issue is with cgroup driver
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".

i ran the same command in worker node as what kubespray is doing in the mentioned task.
/opt/bin/kubeadm join --config /etc/kubernetes/kubeadm-client.conf --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests-errors
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty
[ERROR KubeletVersion]: couldn't get kubelet version: cannot execute 'kubelet --version': executable file not found in $PATH
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...

i tried to change the cgroup driver but still is not working.

Please suggest the possible solution for this.

kinbug

Source

RamanPndy

Most helpful comment

Also faced with this error. With Ubuntu 18.04.3, kubernetes 1.15.3

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml -b -v --flush-cache

fatal: [kube-node04]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=all"], "delta": "0:02:00.006779", "end": "2019-09-03 15:18:34.720298", "msg": "non-zero return code", "rc": 124, "start": "2019-09-03 15:16:34.713519", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists\n\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Activating the kubelet service\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Activating the kubelet service", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused."]}

Please reopen the issue.

vilyansky on 3 Sep 2019

👍4 ❤2

All 13 comments

@ssbarnea please help me here :(

RamanPndy on 18 Apr 2019

So I've been git pull the differences to my local copy but after i did a clean clone and just swapped out the hosts.ini or inventory.ini file - it works fine again.

+1 - I have same issue here also - please help - its been great till today.

fatal: [cvc-prod01]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [cvc-prod02]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [cvc-prod01]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/Users/myname/code/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 95, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}
fatal: [cvc-prod02]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/Users/myname/code/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 95, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}

hashi5 on 19 Apr 2019

this error got resolved by creating kubelet.conf in worker node in /etc/kubernetes with the respective values of worker node so kubeadm join successful but worker node is not showing in kubectl get nodes.

Any idea how to resolve this ??

also kubeadm-config.yaml file has KubeProxyConfiguration but kube-proxy is not getting created.

RamanPndy on 20 Apr 2019

Fixed with #4600

RamanPndy on 24 Apr 2019

Can you please help I am getting same error @RamanPndy
ansible run:
TASK [kubernetes/kubeadm : Join to cluster]
Wednesday 24 April 2019 12:22:20 +0000 (0:00:00.898) 0:04:40.811 *
fatal: [worker-1]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [worker-3]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [worker-2]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

nadimm90 on 29 Apr 2019

@nadimm90 this error means kubelet service is not running in any of your worker nodes.please check in all your worker nodes inside /etc/kubernetes directory kubelet.conf file is there or not and if it's not there then copy it from master node and paste it in /etc/kubernetes directory and restart kubelet service in all worker nodes and then try again. hope so it will work then.

RamanPndy on 29 Apr 2019

@RamanPndy Is there any solution to this problem? I also encountered this mistake

lucian521 on 20 Jun 2019

@RamanPndy Is there any solution to this problem? I also encountered this mistake

Please check kubelet service is running on worker nodes, If kubelet service is not running in worker nodes then try to debug and fix it. Kubelet service must be running in each worker node to join the node in the cluster.

RamanPndy on 20 Jun 2019

Still happening on Ubuntu 18.04. When adding a node, "detected cgroupfs" error occurs and kubelet service is not started. "try to debug and fix kubelet" is not a solution. Please reopen this issue.

holmesb on 30 Aug 2019

Also faced with this error. With Ubuntu 18.04.3, kubernetes 1.15.3

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml -b -v --flush-cache

Please reopen the issue.

vilyansky on 3 Sep 2019

👍4 ❤2

I can also confirm on Ubuntu 18.04.3 LTS that this occurs when running the scale.yml playbook on a brand new Kubernetes cluster.

Checking kubelet logs on the failed scaled node, I see:
Sep 15 18:58:19 mp25 kubelet[28472]: F0915 18:58:19.179238 28472 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

This is quite unusual since I have not overridden any drivers. When running the cluster.yml playbook (when creating a new cluster) with the same inventory file along with my overrides file I do not encounter this issue.

Lurking around the installation on the failed scaled node I found that in /etc/kubernetes/kubelet-config.yaml there was a line cgroupDriver: systemd. Checking the same file on a node that was successfully installed into the cluster revealed the same contents (this is a strange find).

The culprit seems to be the template file kubespray/roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2. Adding this to my inventory.ini file:

[all:vars]
ansible_user=ubuntu
ansible_python_interpreter=/usr/bin/python3
kubelet_cgroup_driver=cgroupfs

solved my problem.

I was using Kubespray from the master branch (commit: 6fe2248314fb319563a60ae023b552371e34e148)

I think this is a bug in Kubespray during the Docker cgroup driver lookup

greenstatic on 15 Sep 2019

👍4 ❤1

Same problem here. New installation, Ubuntu 18.04, scaling from 3 to 4 nodes ends up with failure.
But, as @greenstatic suggested, adding kubelet_cgroup_driver=cgroupfs to vars solves the problem.

bartwitkowski on 18 Sep 2019

Same here.
OS: Centos 7.6
Kubespray release-2.11
Cmd: ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root --flush-cache cluster.yml