Kind: docker readiness timeout may be too low on overloaded machines

Created on 26 Mar 2019  Β·  18Comments  Β·  Source: kubernetes-sigs/kind

sometimes there has a panic, when using kind to create a HA cluster.

[zhang@localhost kind]$ kind create cluster  --config kind-config-ha.yaml           
Creating cluster "kind" ...
 βœ“ Ensuring node image (kindest/node:v1.13.4) πŸ–Ό
 βœ— Preparing nodes πŸ“¦πŸ“¦πŸ“¦πŸ“¦πŸ“¦πŸ“¦
ERRO[17:36:55] timed out waiting for docker to be ready on node kind-control-plane                       
panic: send on closed channel

goroutine 11 [running]:
sigs.k8s.io/kind/pkg/cluster/internal/create.createNodeContainers.func1(0xc00008a300, 0xc000397300, 0x1d,
0xc0002c0000, 0xc000117bc0)
        /home/zhang/go/src/sigs.k8s.io/kind/pkg/cluster/internal/create/nodes.go:114 +0x9c             
created by sigs.k8s.io/kind/pkg/cluster/internal/create.createNodeContainers                             
        /home/zhang/go/src/sigs.k8s.io/kind/pkg/cluster/internal/create/nodes.go:105 +0x2e0

maybe the channel has been closed.

https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/create/nodes.go#L96-L135

/cc @BenTheElder

kinbug prioritimportant-soon

Most helpful comment

perhaps let's bump it to 60s for now, add a TODO, and come back to it then πŸ€”

All 18 comments

@tao12345666333
can you please share your config file and the system specification?

/kind-bug
/priority important-soon

I hit the same issue several times, mainly when using a large number of nodes

ERRO[17:36:55] timed out waiting for docker to be ready on node kind-control-plane

The problem is that the current timeout is set to 30 seconds
https://github.com/kubernetes-sigs/kind/blob/f5fe35507a94031d8bf5221da61c179da98a32e0/pkg/cluster/internal/create/nodes.go#L161

Bumping the value solved the problem to me but I thought that was a problem of slowness in my local environment.

Should we bump the timeout value?

Should we bump the timeout value?

that might be it, it's also fairly slow on my machine as well.
let's wait on @BenTheElder to comment.

can you please share your config file and the system specification?

The config file:


kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker

The system info:

[zhang@localhost kind]$ uname -a
Linux localhost 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[zhang@localhost kind]$ cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core)
[zhang@localhost kind]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.5G        621M        1.9G        121M        5.0G        6.3G
Swap:          7.7G        153M        7.6G
[zhang@localhost kind]$ uptime 
 18:28:02 up 34 days,  4:18,  1 user,  load average: 0.00, 0.05, 0.15
[zhang@localhost kind]$ docker version
Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        6247962
 Built:             Sun Feb 10 04:13:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:47:25 2019
  OS/Arch:          linux/amd64
  Experimental:     false

do you experience this with single node clusters? I think the issue is partly that 5 nodes should need something like 10GB ram so running that many nodes is swapping like crazy.

we might need an arbitrarly large timeout for this, but we do want some sort of timeout to prevent CI hanging πŸ€”

what did you all need to bump it to? I suspect it will grow as your containers are increasingly swap backed. (IE if you added more nodes to an overloaded machine)

why donΒ΄t implement a simple backoff algorithm with the max_number of retries equal to the number of nodes / constant (2 per example) ?

why donΒ΄t implement a simple backoff algorithm with the max_number of retries equal to the number of nodes / constant (2 per example) ?

possibly we should not do that because it's not actually related to the number of nodes directly, it's related to how overloaded your machine is. there's probably a better option πŸ€”

we also don't want actual failures to take excessively long.

we have more similar panics, split that out to #406, using this issue to track the docker timeout.

we may actually just have to do the backoff algorithm, but I suspect we have a more general problem with the behavior when users create more nodes than will reliably fit on their machine (which also may be helped by trying to lighten the load)

Should not panic anymore at least. Timeout still needs thought / changes.

What values are working for your usage?

do you experience this with single node clusters?

[zhang@localhost ~]$ time kind create cluster --name moelove
Creating cluster "moelove" ...
 βœ“ Ensuring node image (kindest/node:v1.13.4) πŸ–Ό
 βœ“ Preparing nodes πŸ“¦ 
 βœ“ Creating kubeadm config πŸ“œ 
 βœ“ Starting control-plane πŸ•ΉοΈ 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="moelove")"
kubectl cluster-info

real    0m52.172s
user    0m0.729s
sys     0m0.564s

In fact, my intention to mention the issue is the problem of panic, not the timeout, although this is also a problem. :smile_cat:

Should not panic anymore at least.

+1

What values are working for your usage?

I change it to 50s, it works fine for me.
(But the appearance of timeout is random, when timeout is 30s.

I'm working with timeout 60 sec without problems since last week

perhaps let's bump it to 60s for now, add a TODO, and come back to it then πŸ€”

or maybe we could add a flag/config to change it?

-1 to more flags! :P

setting this in either is going to be brittle since the value is not portable. the only reason we have a bound at all is to avoid indefinite hang, at some point this value will become quite unreasonable :sweat_smile: (EG 1 hour would be pretty ridiculous)

hah I agree with you.
Today I may test its time on different configured machines, I hope to provide some advice.

Was this page helpful?
0 / 5 - 0 ratings