Kind: When after restart docker, kind cluster could't connect

Created on 23 Jun 2020  ·  12Comments  ·  Source: kubernetes-sigs/kind

I created kind cluster with following YAML:

# a cluster with 3 control-plane nodes and 3 workers
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker
tsunomur@VM:~$ kind create cluster --config kind-example-config.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.18.2) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦 📦 📦
 ✓ Configuring the external load balancer ⚖️
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining more control-plane nodes 🎮
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊
tsunomur@VM:~$ kubectl cluster-info --context kind-kind
Kubernetes master is running at https://127.0.0.1:43185
KubeDNS is running at https://127.0.0.1:43185/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
tsunomur@VM:~$ docker ps -a
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS                       NAMES
03dad9ed89f2        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes                                    kind-worker2
cbd3f2c279a8        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes        127.0.0.1:44681->6443/tcp   kind-control-plane3
1531621e9806        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes                                    kind-worker3
1ceaa76b5149        kindest/haproxy:2.1.1-alpine   "/docker-entrypoint.…"   8 minutes ago       Up 8 minutes        127.0.0.1:43185->6443/tcp   kind-external-load-balancer
a8b8cc91893e        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes        127.0.0.1:43397->6443/tcp   kind-control-plane
5076541a963d        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes                                    kind-worker
e64b81636f9a        kindest/node:v1.18.2           "/usr/local/bin/entr…"   8 minutes ago       Up 6 minutes        127.0.0.1:33069->6443/tcp   kind-control-plane2

And then restart docker(same reboot machine):

$ sudo systemctl stop docker

Result: disappear kind-external-load-balancer and even if rewrite cluster-url to control-plane's IP addr force, Pod deployment will pending ever.

tsunomur@VM:~$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                       NAMES
03dad9ed89f2        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 5 seconds                                    kind-worker2
cbd3f2c279a8        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 5 seconds        127.0.0.1:44681->6443/tcp   kind-control-plane3
1531621e9806        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 5 seconds                                    kind-worker3
a8b8cc91893e        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 4 seconds        127.0.0.1:43397->6443/tcp   kind-control-plane
5076541a963d        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 4 seconds                                    kind-worker
e64b81636f9a        kindest/node:v1.18.2   "/usr/local/bin/entr…"   10 minutes ago      Up 5 seconds        127.0.0.1:33069->6443/tcp   kind-control-plane2

Does kind not support restart machine?

Ref:

kinsupport

Most helpful comment

50 nodes? Cool! That's actually the largest single kind cluster I've heard of so far :-)

I've tried to push it further just out of curiosity but a 100 nodes cluster attempt brought my machine down to its knees with a ridiculous 2500+ load average at some point :grin:

I work on Cilium (so I use kind with the Cilium CNI) and at the moment more specifically on Hubble Relay for cluster wide observability and being able to test things in a local multi-nodes cluster is just amazing. I used to have to run multiple VMs but this process is much heavier. We're also able to test things like cluster mesh with kind. We also recently introduced kind as part of our CI to run smoke tests.

All 12 comments

we need to know more details, like what version you're using.
kind does restart them on the latest version.

it would also be helpful to know if this happens with a simple kind create cluster (no config, no flags) and if so more about what your host environment is like

Thank you for your quick reply.

I use 0.8.1:

tsunomur@VM:~$ kind --version
kind version 0.8.1

When I created a simple cluster, not same stituation.
But exists Pod to be Error.

create cluster and check health

tsunomur@VM:~$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.18.2) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋
tsunomur@VM:~$ k run nginx --image nginx --restart=Never
pod/nginx created
tsunomur@VM:~$ k get po
NAME    READY   STATUS              RESTARTS   AGE
nginx   0/1     ContainerCreating   0          2s
tsunomur@VM:~$ k get po -w
NAME    READY   STATUS              RESTARTS   AGE
nginx   0/1     ContainerCreating   0          3s
nginx   1/1     Running             0          17s
^Ctsunomur@VM:~$ k cluster-info
Kubernetes master is running at https://127.0.0.1:38413
KubeDNS is running at https://127.0.0.1:38413/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
tsunomur@VM:~$ k get componentstatuses
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health":"true"}
tsunomur@VM:~$
tsunomur@VM:~$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                       NAMES
be19fb44893d        kindest/node:v1.18.2   "/usr/local/bin/entr…"   2 minutes ago       Up About a minute   127.0.0.1:38413->6443/tcp   kind-control-plane
tsunomur@VM:~$

Restart docker and check health

tsunomur@VM:~$ sudo systemctl stop docker
tsunomur@VM:~$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2020-06-23 17:54:16 UTC; 7s ago
     Docs: https://docs.docker.com
  Process: 31153 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
 Main PID: 31153 (code=exited, status=0/SUCCESS)

Jun 23 17:15:16 VM dockerd[31153]: time="2020-06-23T17:15:16.225369221Z" level=info msg="API listen on /var/run/docker.sock"
Jun 23 17:15:16 VM systemd[1]: Started Docker Application Container Engine.
Jun 23 17:16:43 VM dockerd[31153]: time="2020-06-23T17:16:43.583494164Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 17:54:02 VM systemd[1]: Stopping Docker Application Container Engine...
Jun 23 17:54:02 VM dockerd[31153]: time="2020-06-23T17:54:02.941309822Z" level=info msg="Processing signal 'terminated'"
Jun 23 17:54:12 VM dockerd[31153]: time="2020-06-23T17:54:12.957486326Z" level=info msg="Container be19fb44893d46e0e7800cd8af414b80fc5d4bccd0d050ce282a685dd93d3735 failed to exit within
Jun 23 17:54:15 VM dockerd[31153]: time="2020-06-23T17:54:15.089736440Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 17:54:16 VM dockerd[31153]: time="2020-06-23T17:54:16.254422134Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=m
Jun 23 17:54:16 VM dockerd[31153]: time="2020-06-23T17:54:16.254886136Z" level=info msg="Daemon shutdown complete"
Jun 23 17:54:16 VM systemd[1]: Stopped Docker Application Container Engine.
tsunomur@VM:~$ sudo systemctl start docker
tsunomur@VM:~$ k cluster-info
Kubernetes master is running at https://127.0.0.1:38413
KubeDNS is running at https://127.0.0.1:38413/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
tsunomur@VM:~$ k get componentstatuses
NAME                 STATUS    MESSAGE             ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health":"true"}
tsunomur@VM:~$ k run nginx-after-restart --image nginx --restart=Never
pod/nginx-after-restart created
tsunomur@VM:~$ k get po
NAME                  READY   STATUS              RESTARTS   AGE
nginx                 0/1     Unknown             0          2m
nginx-after-restart   0/1     ContainerCreating   0          2s

But if only a Pod(no manage by Deployment) status is Error, I will recreate.

I won't use multi-node cluster yet.

yeah some errored pods is expected, not all things handle the IP switch well etc.

The cluster not coming back up with multi node is not,

what happens if you use:

# a cluster with 3 control-plane nodes and 3 workers
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker

It's possible we have a bug in the "HA" mode, it's not well tested or used for much currently.

I tried only a control-plane cluster with multi worker, and then reboot dockerd, it's seem to good condition.
I'll create only one control-plane from next time.

Thank you.

I think this issue should be re-opened. The problem occurs when more than 1 control-plane is used. I could reproduce easily using this config (kind v0.8.1, docker 19.03.11-ce) :

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
$ docker ps -a
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS                     PORTS                       NAMES
9086b0999d6a        kindest/haproxy:2.1.1-alpine   "/docker-entrypoint.…"   5 minutes ago       Exited (0) 2 minutes ago                               kind-external-load-balancer
938b62548187        kindest/node:v1.18.2           "/usr/local/bin/entr…"   5 minutes ago       Up About a minute          127.0.0.1:39575->6443/tcp   kind-control-plane
d665bd9e5fe3        kindest/node:v1.18.2           "/usr/local/bin/entr…"   5 minutes ago       Up About a minute          127.0.0.1:34927->6443/tcp   kind-control-plane2

I don't think 2 control planes is valid in kubeadm @Rolinh, only 3? I thought we validated this but we must not.

That said, it does seem we have a bug here with multiple control planes.

I'm going to interject a brief note: I _highly_ recommend testing with a single node cluster unless you have strong evidence that multi-node is relevant, doubly so for multi-control plane.

@BenTheElder fwiw, the issue is the same with 3 control planes.

I'm going to interject a brief note: I highly recommend testing with a single node cluster unless you have strong evidence that multi-node is relevant, doubly so for multi-control plane.

Would you mind expanding on this? Why is this a problem? I've been testing things with up to 50 nodes clusters without issues so far except upon docker service restart (or machine reboot). As a single control-plane is sufficient, I'll stick to this but I do require to tests things in a multi-node clusters.

50 nodes? Cool! That's actually the largest single kind cluster I've heard of so far :-)

Many (most?) apps are unlikely to gain anything testing wise from multiple nodes, but running multi-node kind clusters overcommits the hardware (each node reports having the full host resources) while adding more overhead.

The "HA" mode is not actually HA due to etcd and due to running on top of one physical host ... it is somewhat useful for certain things where multiple api-servers matters.

Similarly multi-node is used for testing where multi-node rolling behavior matters (we test kubernetes itself with 1 control plane and 2 workers typically), outside of that it's just extra complexity and overhead.

50 nodes? Cool! That's actually the largest single kind cluster I've heard of so far :-)

I've tried to push it further just out of curiosity but a 100 nodes cluster attempt brought my machine down to its knees with a ridiculous 2500+ load average at some point :grin:

I work on Cilium (so I use kind with the Cilium CNI) and at the moment more specifically on Hubble Relay for cluster wide observability and being able to test things in a local multi-nodes cluster is just amazing. I used to have to run multiple VMs but this process is much heavier. We're also able to test things like cluster mesh with kind. We also recently introduced kind as part of our CI to run smoke tests.

cool, that's definitely one of those apps that will benefit from multi-node :-)
we see a lot of people going a bit nuts with nodes to run web-app like services that don't benefit from this 😅

tracking the HA restart issue with a bug here https://github.com/kubernetes-sigs/kind/issues/1689
closing this one, but will continue responding to comments 😅

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neolit123 picture neolit123  ·  62Comments

font picture font  ·  34Comments

BenTheElder picture BenTheElder  ·  58Comments

mitar picture mitar  ·  49Comments

hjacobs picture hjacobs  ·  31Comments