rke starting kubelet fails native docker daemon with CentOS 7.9 nodes

Created on 6 Jan 2021  路  9Comments  路  Source: rancher/rke

RKE version:
rke v1.1.12, v1.2.3 and v1.2.4-rc9

Docker version: (docker version,docker info preferred)
docker-1.13.1-203.git0be3e21.el7

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
CentOS 7.9 with Linux kernel 3.10.0-1160.11.1.el7.x86_64
SELinux enabled.

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
VMware VMs

* cluster.yml: *
default/stock rke generated cluster.yml for kubernetes v1.18.6 / v1.18.12 / v1.18.14.

Steps to Reproduce:

  • Use rke to generate cluster.yml with kubernetes v1.18 cluster, use default options.
  • Verify docker works OK on the nodes, use "docker ps", run some containers etc.
  • Use rke to provision kubernetes cluster.
  • docker daemon on the nodes crashes / stops responding when rke starts kubelet container on the nodes.
  • even "docker ps" on the nodes does not work anymore, it waits forever being stuck, with no output.

Results:

rke works OK if docker on the nodes is downgraded to CentOS 7.8 version (docker-1.13.1-162.git64e9980.el7) prior to running rke. This problem seems to happen only with the latest version of native docker rpm (docker-1.13.1-203.git0be3e21.el7) on the nodes when running rke.

docker on the nodes seems to work otherwise, until the moment rke starts kubelet - then docker daemon seems to crash somehow.

rke output when nodes are running "docker-1.13.1-203.git0be3e21":

INFO[0409] Starting container [kubelet] on host [10.10.10.12], try #1
INFO[0409] Starting container [kubelet] on host [10.10.10.13], try #1
INFO[0409] Starting container [kubelet] on host [10.10.10.11], try #1
DEBU[0459] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0459] Can't start Docker container [kubelet] on host [10.10.10.12]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0459] Starting container [kubelet] on host [10.10.10.12], try #2
DEBU[0459] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0459] Can't start Docker container [kubelet] on host [10.10.10.13]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0459] Starting container [kubelet] on host [10.10.10.13], try #2
DEBU[0459] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0459] Can't start Docker container [kubelet] on host [10.10.10.11]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0459] Starting container [kubelet] on host [10.10.10.11], try #2
DEBU[0510] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0510] Can't start Docker container [kubelet] on host [10.10.10.12]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0510] Starting container [kubelet] on host [10.10.10.12], try #3
DEBU[0510] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0510] Can't start Docker container [kubelet] on host [10.10.10.11]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0510] Starting container [kubelet] on host [10.10.10.11], try #3
DEBU[0510] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0510] Can't start Docker container [kubelet] on host [10.10.10.13]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0510] Starting container [kubelet] on host [10.10.10.13], try #3
DEBU[0510] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0510] Can't start Docker container [kubelet] on host [10.10.10.11]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0510] Starting container [kubelet] on host [10.10.10.11], try #3
DEBU[0510] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0561] Can't start Docker container [kubelet] on host [10.10.10.12]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
DEBU[0561] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0561] Can't start Docker container [kubelet] on host [10.10.10.13]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
DEBU[0561] FIXME: Got an status-code for which error does not match any expected type!!!: -1  module=api status_code=-1
WARN[0561] Can't start Docker container [kubelet] on host [10.10.10.11]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

gz#15619

internal kinbug

Most helpful comment

Well, the workaround I mentioned above works, eg. downgrade to version docker-1.13.1-162.git64e9980.el7.

Would be nice to figure out though why rke/kubelet does not work with the latest el7.9 docker version..

All 9 comments

I've also run into this and have not been able to find a work around or fix.

I've also hit this and haven't found a work around.

Well, the workaround I mentioned above works, eg. downgrade to version docker-1.13.1-162.git64e9980.el7.

Would be nice to figure out though why rke/kubelet does not work with the latest el7.9 docker version..

can confirm downgrading docker version to 1.13.1-162 fixes this

I can also confirm.

Installing docker-1.13.1-162.git64e9980.el7_8 on RHEL 7.9 fixes this.

I'm not sure if this a Docker or RKE issue.

Was able to recreate on Digital Ocean:

yum update -y # to get to 7.9, looks like 7.6 is shipped
yum install docker -y
systemctl enable docker
reboot # for kernel update

Follow the docs for dockerroot group ownership.

rke up reveals the following:

INFO[0102] Starting container [kubelet] on host [68.183.116.42], try #1
WARN[0152] Can't start Docker container [kubelet] on host [68.183.116.42]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0152] Starting container [kubelet] on host [68.183.116.42], try #2
WARN[0203] Can't start Docker container [kubelet] on host [68.183.116.42]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
INFO[0203] Starting container [kubelet] on host [68.183.116.42], try #3
WARN[0253] Can't start Docker container [kubelet] on host [68.183.116.42]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
FATA[0253] [workerPlane] Failed to bring up Worker Plane: [Failed to start [kubelet] container on host [68.183.116.42]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?]

Docker commands also seem to go unresponsive.

To add to @bentastic27 comment. I followed the same process as well as disabling SELinux for the engine and I received the same result

It seems like the same kubelet / docker crash problem still happens with the latest native el7 docker version: docker-1.13.1-204.git0be3e21.el7.x86_64

I believe I reproduced the problem in Red Hat Bugzilla 1943700 and found it likely to be the issue described in Red Hat Bugzilla 1896883, so I closed the former in favor of the latter. You can follow along there. Thanks!

Was this page helpful?
0 / 5 - 0 ratings