Kubespray: Unable to Join Cluster

Created on 18 Apr 2019  路  13Comments  路  Source: kubernetes-sigs/kubespray

Environment:

  • Cloud provider or hardware configuration: Virtual Machine
  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Ubuntu 16.0.4

  • Version of Ansible (ansible --version):2.7.10

Kubespray version (commit) (git rev-parse --short HEAD):6f919e5

Network plugin used: flannel

Copy of your inventory file:
[all]
master ansible_host=161.X.X.X ip=161.X.X.X ansible_user=raman ansible_sudo=yes
worker ansible_host=161.X.X.X ip=161.X.X.X ansible_user=raman ansible_sudo=yes

[kube-master]
master

[etcd]
master

[kube-node]
worker

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

Command used to invoke ansible:
ansible-playbook --flush-cache -i inventory/mycluster/inventory.ini cluster.yml --ask-pass --become --ask-become-pass

Output of ansible run:
TASK [kubernetes/kubeadm : Join to cluster] ******************************************************
Thursday 18 April 2019 22:18:57 +0530 (0:00:00.090) 0:03:18.044
*
fatal: [worker]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

TASK [kubernetes/kubeadm : Join to cluster with ignores] **************************************************
Thursday 18 April 2019 22:19:59 +0530 (0:01:01.563) 0:04:19.607 **
fatal: [worker]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

TASK [kubernetes/kubeadm : Display kubeadm join stderr if any] ************************************************
Thursday 18 April 2019 22:21:00 +0530 (0:01:01.543) 0:05:21.151 **
fatal: [worker]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/root/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 101, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}

Anything else do we need to know:
In worker node kubeadm is not able to join the cluster. I debugged and suspecting the issue is with cgroup driver
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".

i ran the same command in worker node as what kubespray is doing in the mentioned task.
/opt/bin/kubeadm join --config /etc/kubernetes/kubeadm-client.conf --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests-errors
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty
[ERROR KubeletVersion]: couldn't get kubelet version: cannot execute 'kubelet --version': executable file not found in $PATH
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...

i tried to change the cgroup driver but still is not working.

Please suggest the possible solution for this.

kinbug

Most helpful comment

Also faced with this error. With Ubuntu 18.04.3, kubernetes 1.15.3

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml -b -v --flush-cache

fatal: [kube-node04]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=all"], "delta": "0:02:00.006779", "end": "2019-09-03 15:18:34.720298", "msg": "non-zero return code", "rc": 124, "start": "2019-09-03 15:16:34.713519", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists\n\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Activating the kubelet service\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Activating the kubelet service", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused."]}

Please reopen the issue.

All 13 comments

@ssbarnea please help me here :(

So I've been git pull the differences to my local copy but after i did a clean clone and just swapped out the hosts.ini or inventory.ini file - it works fine again.


+1 - I have same issue here also - please help - its been great till today.

fatal: [cvc-prod01]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [cvc-prod02]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [cvc-prod01]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/Users/myname/code/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 95, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}
fatal: [cvc-prod02]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stderr_lines'\n\nThe error appears to have been in '/Users/myname/code/kubespray/roles/kubernetes/kubeadm/tasks/main.yml': line 95, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: Display kubeadm join stderr if any\n ^ here\n"}

this error got resolved by creating kubelet.conf in worker node in /etc/kubernetes with the respective values of worker node so kubeadm join successful but worker node is not showing in kubectl get nodes.

Any idea how to resolve this ??

also kubeadm-config.yaml file has KubeProxyConfiguration but kube-proxy is not getting created.

Fixed with #4600

Can you please help I am getting same error @RamanPndy
ansible run:
TASK [kubernetes/kubeadm : Join to cluster]
Wednesday 24 April 2019 12:22:20 +0000 (0:00:00.898) 0:04:40.811 *
fatal: [worker-1]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [worker-3]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}
fatal: [worker-2]: FAILED! => {"changed": false, "msg": "async task did not complete within the requested time"}

@nadimm90 this error means kubelet service is not running in any of your worker nodes.please check in all your worker nodes inside /etc/kubernetes directory kubelet.conf file is there or not and if it's not there then copy it from master node and paste it in /etc/kubernetes directory and restart kubelet service in all worker nodes and then try again. hope so it will work then.

@RamanPndy Is there any solution to this problem? I also encountered this mistake

@RamanPndy Is there any solution to this problem? I also encountered this mistake

Please check kubelet service is running on worker nodes, If kubelet service is not running in worker nodes then try to debug and fix it. Kubelet service must be running in each worker node to join the node in the cluster.

Still happening on Ubuntu 18.04. When adding a node, "detected cgroupfs" error occurs and kubelet service is not started. "try to debug and fix kubelet" is not a solution. Please reopen this issue.

Also faced with this error. With Ubuntu 18.04.3, kubernetes 1.15.3

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml -b -v --flush-cache

fatal: [kube-node04]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=all"], "delta": "0:02:00.006779", "end": "2019-09-03 15:18:34.720298", "msg": "non-zero return code", "rc": 124, "start": "2019-09-03 15:16:34.713519", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists\n\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Activating the kubelet service\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Activating the kubelet service", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused."]}

Please reopen the issue.

I can also confirm on Ubuntu 18.04.3 LTS that this occurs when running the scale.yml playbook on a brand new Kubernetes cluster.

Checking kubelet logs on the failed scaled node, I see:
Sep 15 18:58:19 mp25 kubelet[28472]: F0915 18:58:19.179238 28472 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

This is quite unusual since I have not overridden any drivers. When running the cluster.yml playbook (when creating a new cluster) with the same inventory file along with my overrides file I do not encounter this issue.

Lurking around the installation on the failed scaled node I found that in /etc/kubernetes/kubelet-config.yaml there was a line cgroupDriver: systemd. Checking the same file on a node that was successfully installed into the cluster revealed the same contents (this is a strange find).

The culprit seems to be the template file kubespray/roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2. Adding this to my inventory.ini file:

[all:vars]
ansible_user=ubuntu
ansible_python_interpreter=/usr/bin/python3
kubelet_cgroup_driver=cgroupfs

solved my problem.

I was using Kubespray from the master branch (commit: 6fe2248314fb319563a60ae023b552371e34e148)

I think this is a bug in Kubespray during the Docker cgroup driver lookup

Same problem here. New installation, Ubuntu 18.04, scaling from 3 to 4 nodes ends up with failure.
But, as @greenstatic suggested, adding kubelet_cgroup_driver=cgroupfs to vars solves the problem.

Same here.
OS: Centos 7.6
Kubespray release-2.11
Cmd: ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root --flush-cache cluster.yml

Was this page helpful?
0 / 5 - 0 ratings