sometimes there has a panic, when using kind to create a HA cluster.
[zhang@localhost kind]$ kind create cluster --config kind-config-ha.yaml
Creating cluster "kind" ...
β Ensuring node image (kindest/node:v1.13.4) πΌ
β Preparing nodes π¦π¦π¦π¦π¦π¦
ERRO[17:36:55] timed out waiting for docker to be ready on node kind-control-plane
panic: send on closed channel
goroutine 11 [running]:
sigs.k8s.io/kind/pkg/cluster/internal/create.createNodeContainers.func1(0xc00008a300, 0xc000397300, 0x1d,
0xc0002c0000, 0xc000117bc0)
/home/zhang/go/src/sigs.k8s.io/kind/pkg/cluster/internal/create/nodes.go:114 +0x9c
created by sigs.k8s.io/kind/pkg/cluster/internal/create.createNodeContainers
/home/zhang/go/src/sigs.k8s.io/kind/pkg/cluster/internal/create/nodes.go:105 +0x2e0
maybe the channel has been closed.
https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/create/nodes.go#L96-L135
/cc @BenTheElder
@tao12345666333
can you please share your config file and the system specification?
/kind-bug
/priority important-soon
I hit the same issue several times, mainly when using a large number of nodes
ERRO[17:36:55] timed out waiting for docker to be ready on node kind-control-plane
The problem is that the current timeout is set to 30 seconds
https://github.com/kubernetes-sigs/kind/blob/f5fe35507a94031d8bf5221da61c179da98a32e0/pkg/cluster/internal/create/nodes.go#L161
Bumping the value solved the problem to me but I thought that was a problem of slowness in my local environment.
Should we bump the timeout value?
Should we bump the timeout value?
that might be it, it's also fairly slow on my machine as well.
let's wait on @BenTheElder to comment.
can you please share your config file and the system specification?
The config file:
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
The system info:
[zhang@localhost kind]$ uname -a
Linux localhost 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[zhang@localhost kind]$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[zhang@localhost kind]$ free -h
total used free shared buff/cache available
Mem: 7.5G 621M 1.9G 121M 5.0G 6.3G
Swap: 7.7G 153M 7.6G
[zhang@localhost kind]$ uptime
18:28:02 up 34 days, 4:18, 1 user, load average: 0.00, 0.05, 0.15
[zhang@localhost kind]$ docker version
Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:27 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:47:25 2019
OS/Arch: linux/amd64
Experimental: false
do you experience this with single node clusters? I think the issue is partly that 5 nodes should need something like 10GB ram so running that many nodes is swapping like crazy.
we might need an arbitrarly large timeout for this, but we do want some sort of timeout to prevent CI hanging π€
what did you all need to bump it to? I suspect it will grow as your containers are increasingly swap backed. (IE if you added more nodes to an overloaded machine)
why donΒ΄t implement a simple backoff algorithm with the max_number of retries equal to the number of nodes / constant (2 per example) ?
why donΒ΄t implement a simple backoff algorithm with the max_number of retries equal to the number of nodes / constant (2 per example) ?
possibly we should not do that because it's not actually related to the number of nodes directly, it's related to how overloaded your machine is. there's probably a better option π€
we also don't want actual failures to take excessively long.
we have more similar panics, split that out to #406, using this issue to track the docker timeout.
we may actually just have to do the backoff algorithm, but I suspect we have a more general problem with the behavior when users create more nodes than will reliably fit on their machine (which also may be helped by trying to lighten the load)
Should not panic anymore at least. Timeout still needs thought / changes.
What values are working for your usage?
do you experience this with single node clusters?
[zhang@localhost ~]$ time kind create cluster --name moelove
Creating cluster "moelove" ...
β Ensuring node image (kindest/node:v1.13.4) πΌ
β Preparing nodes π¦
β Creating kubeadm config π
β Starting control-plane πΉοΈ
Cluster creation complete. You can now use the cluster with:
export KUBECONFIG="$(kind get kubeconfig-path --name="moelove")"
kubectl cluster-info
real 0m52.172s
user 0m0.729s
sys 0m0.564s
In fact, my intention to mention the issue is the problem of panic, not the timeout, although this is also a problem. :smile_cat:
Should not panic anymore at least.
+1
What values are working for your usage?
I change it to 50s, it works fine for me.
(But the appearance of timeout is random, when timeout is 30s.
I'm working with timeout 60 sec without problems since last week
perhaps let's bump it to 60s for now, add a TODO, and come back to it then π€
or maybe we could add a flag/config to change it?
-1 to more flags! :P
setting this in either is going to be brittle since the value is not portable. the only reason we have a bound at all is to avoid indefinite hang, at some point this value will become quite unreasonable :sweat_smile: (EG 1 hour would be pretty ridiculous)
hah I agree with you.
Today I may test its time on different configured machines, I hope to provide some advice.
Most helpful comment
perhaps let's bump it to 60s for now, add a TODO, and come back to it then π€