Kubespray: kubelet start error

Created on 4 Sep 2019 · 12Comments · Source: kubernetes-sigs/kubespray

Environment:

Cloud provider or hardware configuration:
Baremetal

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 4.18.0-25-generic x86_64
NAME="Ubuntu"
VERSION="18.10 (Cosmic Cuttlefish)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.10"
VERSION_ID="18.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=cosmic
UBUNTU_CODENAME=cosmic
Version of Ansible (ansible --version):
ansible 2.7.8
config file = None
configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Apr 9 2019, 04:59:38) [GCC 8.3.0]

Kubespray version (commit) (git rev-parse --short HEAD):
56523812

Network plugin used:
default

Copy of your inventory file:

Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yml scale.yml -b -v --flush-cache

Output of ansible run:
fatal: [kube-node04]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/usr/local/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=all"], "delta": "0:02:00.006779", "end": "2019-09-03 15:18:34.720298", "msg": "non-zero return code", "rc": 124, "start": "2019-09-03 15:16:34.713519", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists\n\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "\t[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "\t[WARNING FileAvailable--etc-kubernetes-bootstrap-kubelet.conf]: /etc/kubernetes/bootstrap-kubelet.conf already exists", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING FileAvailable--etc-kubernetes-ssl-ca.crt]: /etc/kubernetes/ssl/ca.crt already exists"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Activating the kubelet service\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.15\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Activating the kubelet service", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.", "[kubelet-check] It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused."]}

Anything else do we need to know:
A day ago I have successfully setup kubernetes cluster of 6 nodes (3 master & 3 worker) with
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root --flush-cache cluster.yml

kinbug lifecyclrotten

Source

vilyansky

👍2

All 12 comments

With the ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root --flush-cache cluster.yml host succesfully add in cluster.

vilyansky on 4 Sep 2019

👍1

Hi @vilyansky, got the same issue with scale.yml playbook when trying kubespray a day ago.
Target nodes are Debian Stretch on AWS provisioned with the terraform contrib repository.

Digging a little I came to the following conclusion:
cgroupDriver: {{ kubelet_cgroup_driver|default(kubelet_cgroup_driver_detected) }} in roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2 is resolved as cgroupDriver: cgroupfs (OK) with cluster.yml and cgroupDriver: systemd (Wrong) with scale.yml.

This happens because kubelet_cgroup_driver is set to systemd in roles/container-engine/containerd/defaults/main.yml.
This file is included as conditional dependency of roles/container-engine/meta/main.yml but it looks like the defaults are included, regardless of the when: [ container_manager == 'containerd' ] statement.

The variable kubelet_cgroup_driver seems to be scoped to the play.
In cluster.yml, the roles container-engine and kubernetes/node are in distinct plays. In scale.yml, they are in the same play.

So I'm wondering 2 things:

to ansible experts: is it an ansible bug or expected behavior to include the defaults of a meta dependencies if a when condition if not matched? (the doc is not so clear about it)
do we still need the kubelet_cgroup_driver as default variable for container-engine/containerd subrole since
- autodetection should work with kubelet_cgroup_driver_detected if not specified
- it looks only used by the template roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2
- kubelet_cgroup_driver will still work as expected if the variable is customized by the user

Thanks @vilyansky for spotting it nearly at the same time, I feel not alone :wink:

clook on 5 Sep 2019

👍1

For now, my workaround (hacky) is to split scale.yml in 2 distinct plays:

diff --git a/scale.yml b/scale.yml
index 094be2a1..14d82a0a 100644
--- a/scale.yml
+++ b/scale.yml
@@ -43,6 +43,13 @@
     - { role: container-engine, tags: "container-engine", when: deploy_container_engine|default(true) }
     - { role: download, tags: download, when: "not skip_downloads" }
     - { role: etcd, tags: etcd, etcd_cluster_setup: false }
+  environment: "{{ proxy_env }}"
+
+- name: Target only workers to get kubelet installed and checking in on any new nodes end
+  hosts: kube-node
+  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
+  roles:
+    - { role: kubespray-defaults}
     - { role: kubernetes/node, tags: node }
     - { role: kubernetes/kubeadm, tags: kubeadm }
     - { role: network_plugin, tags: network }

clook on 5 Sep 2019

Checking node logs added with scale.yml:

Sep 06 12:53:34 node4 kubelet[19219]: F0906 12:53:34.030259   19219 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgoup driver: "cgroupfs"
Sep 06 12:53:34 node4 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Sep 06 12:53:34 node4 systemd[1]: Unit kubelet.service entered failed state.
Sep 06 12:53:34 node4 systemd[1]: kubelet.service failed.
Sep 06 12:53:44 node4 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Sep 06 12:53:44 node4 systemd[1]: Stopped Kubernetes Kubelet Server.
Sep 06 12:53:44 node4 systemd[1]: Started Kubernetes Kubelet Server.
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417555   19260 flags.go:33] FLAG: --address="0.0.0.0"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417668   19260 flags.go:33] FLAG: --allowed-unsafe-sysctls="[]"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417678   19260 flags.go:33] FLAG: --alsologtostderr="false"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417686   19260 flags.go:33] FLAG: --anonymous-auth="true"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417693   19260 flags.go:33] FLAG: --application-metrics-count-limit="100"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417699   19260 flags.go:33] FLAG: --authentication-token-webhook="false"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417704   19260 flags.go:33] FLAG: --authentication-token-webhook-cache-ttl="2m0s"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417714   19260 flags.go:33] FLAG: --authorization-mode="AlwaysAllow"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417722   19260 flags.go:33] FLAG: --authorization-webhook-cache-authorized-ttl="5m0s"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417729   19260 flags.go:33] FLAG: --authorization-webhook-cache-unauthorized-ttl="30s"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417734   19260 flags.go:33] FLAG: --azure-container-registry-config=""
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417742   19260 flags.go:33] FLAG: --boot-id-file="/proc/sys/kernel/random/boot_id"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417748   19260 flags.go:33] FLAG: --bootstrap-checkpoint-path=""
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417754   19260 flags.go:33] FLAG: --bootstrap-kubeconfig="/etc/kubernetes/bootstrap-kubelet.conf"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417760   19260 flags.go:33] FLAG: --cert-dir="/var/lib/kubelet/pki"
Sep 06 12:53:44 node4 kubelet[19260]: I0906 12:53:44.417766   19260 flags.go:33] FLAG: --cgroup-driver="cgroupfs"

so the kubelet flag says it's cgroupfs but it can not start anyway?

MikeMichel on 6 Sep 2019

Hi @MikeMichel, thanks for your feedback.
Can you check the content of the file /etc/kubernetes/kubelet-config.yaml and /var/lib/kubelet/config.yaml (cgroupDriver attribute) on your node?

clook on 6 Sep 2019

it did not worked well. the playbook fails with the same message like for vilyansky. thanks for the hint to /etc/kubernetes/kubelet-config.yaml. this was indeed set to systemd while a lot of other confs like /var/lib/kubelet/config.yaml are set to cgroupfs

MikeMichel on 6 Sep 2019

For people wanting to fix this for the scale.yml playbook without much changes, setting the following variable on the inventory fixed the issue for me:

...
[all:vars]
...
kubelet_cgroup_driver="cgroupfs"
...

Thanks @clook for the pointer.

lucasslima on 7 Nov 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 5 Feb 2020

Hi @clook @lucasslima,
I am newbie and trying to deploy k8s cluster with kubespray (version: 2.11.0 ), terraform (version: 0.11.13), python (version: 2.7.17) and ansible (version: 2.7.8). I was able to create server infra but, while running ansible command; facing same issue as reported by @vilyansky. Error message is as given below -
TASK [kubernetes/kubeadm : Join to cluster] ************************************************************************************************************************************************************************************* Sunday 09 February 2020 11:18:32 +0000 (0:00:01.138) 0:11:40.322 ******* fatal: [kubernetes-fancy-dev-worker1]: FAILED! => {"changed": true, "cmd": ["timeout", "-k", "120s", "120s", "/opt/bin/kubeadm", "join", "--config", "/etc/kubernetes/kubeadm-client.conf", "--ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests"], "delta": "0:02:00.004830", "end": "2020-02-09 11:20:33.302557", "msg": "non-zero return code", "rc": 124, "start": "2020-02-09 11:18:33.297727", "stderr": "\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "stderr_lines": ["\t[WARNING DirAvailable--etc-kubernetes-manifests]: /etc/kubernetes/manifests is not empty", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}

@lucasslima, I am using cluster.yml and as per your suggestion, i did try with
[all:vars] kubelet_cgroup_driver="cgroupfs"
in my inventory/local/host.yml file, but still there is no luck..

@clook, I did try with cgroupDriver: "cgroupfs" as well, mentioning into cluster.yml. but, still receiving same error.

@clook, one more thing is, i am getting error, with respect to kube-proxy, error is as given below -
fatal: [kubernetes-fancy-dev-master0]: FAILED! => {"changed": true, "cmd": "/opt/bin/kubectl --kubeconfig /etc/kubernetes/admin.conf get ds kube-proxy --namespace=kube-system -o jsonpath='{.spec.template.spec.nodeSelector.beta.kubernetes.io/os}'", "delta": "0:00:00.099808", "end": "2020-02-09 11:23:07.468668", "msg": "non-zero return code", "rc": 1, "start": "2020-02-09 11:23:07.368860", "stderr": "Error from server (NotFound): daemonsets.extensions \"kube-proxy\" not found", "stderr_lines": ["Error from server (NotFound): daemonsets.extensions \"kube-proxy\" not found"], "stdout": "", "stdout_lines": []}

to resolve kube-proxy related issue, i referred this comment and did try to apply daemonset manually but, master wouldn't allow me to do that, throwing error -
error: unable to recognize "kube-proxy-daemonset.yaml": Get http://localhost:8080/api?timeout=32s: dial tcp 127.0.0.1:8080: connect: connection refused

Guys, can you help me out here ?
early response is appreciated..

kvishweshwar on 9 Feb 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 10 Mar 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 9 Apr 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.