What happened:
Running kind to create an HA cluster like the one found here (except with 2 control-planes instead of 3)
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker
What you expected to happen:
Cluster to get created and come up
How to reproduce it (as minimally and precisely as possible):
kind create cluster --retain --loglevel trace --config "./kind-cluster.yaml" --wait 5m;
Anything else we need to know?:
Creating a single control-plane cluster works fine on this machine. Deleted and recreated several times to verify
Debug logging output:
[addons] Applied essential addon: kube-proxy
I0719 17:37:45.225992 142 loader.go:359] Config loaded from file: /etc/kubernetes/admin.conf
I0719 17:37:45.226805 142 loader.go:359] Config loaded from file: /etc/kubernetes/admin.conf
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of control-plane nodes by copying certificate authorities
and service account keys on each node and then running the following as root:
kubeadm join 172.17.0.3:6443 --token <value withheld> \
--discovery-token-ca-cert-hash sha256:e8f007ca6d45412c838744e330cb1516774f0dac8593f1588b90b33d3a248a57 \
--experimental-control-plane
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 172.17.0.3:6443 --token <value withheld> \
--discovery-token-ca-cert-hash sha256:e8f007ca6d45412c838744e330cb1516774f0dac8593f1588b90b33d3a248a57
DEBU[12:37:45] Running: /snap/bin/docker [docker inspect -f {{(index (index .NetworkSettings.Ports "6443/tcp") 0).HostPort}} kind-external-load-balancer]
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged kind-control-plane cat /etc/kubernetes/admin.conf]
â Starting control-plane đšī¸
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged kind-control-plane cat /kind/manifests/default-cni.yaml]
DEBU[12:37:45] Running: /snap/bin/docker [docker exec --privileged -i kind-control-plane kubectl create --kubeconfig=/etc/kubernetes/admin.conf -f -]
â Installing CNI đ
DEBU[12:37:46] Running: /snap/bin/docker [docker exec --privileged -i kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf apply -f -]
â Installing StorageClass đž
DEBU[12:37:46] Running: /snap/bin/docker [docker exec --privileged kind-control-plane2 mkdir -p /etc/kubernetes/pki/etcd]
DEBU[12:37:46] Running: /snap/bin/docker [docker cp kind-control-plane:/etc/kubernetes/pki/ca.crt /tmp/864842991/ca.crt]
â Joining more control-plane nodes đŽ
Error: failed to create cluster: failed to copy certificate ca.crt: exit status 1
Environment:
kind version): v0.4.0kubectl version):Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-21T13:09:06Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
docker info):Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 19
Server Version: 18.06.1-ce
Storage Driver: aufs
Root Dir: /var/snap/docker/common/var-lib-docker/aufs
Backing Filesystem: extfs
Dirs: 129
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: N/A (expected: 69663f0bd4b60df09991c08812a60108003fa340)
init version: 949e6fa (expected: fec3683)
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.0.0-20-generic
Operating System: Ubuntu Core 16
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 62.84GiB
Name: codor
ID: RR2C:GHT4:VNPO:ZXHC:RWW4:YYDG:OPA3:Y53B:WTBZ:23C3:AHLB:UWLN
Docker Root Dir: /var/snap/docker/common/var-lib-docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 53
Goroutines: 62
System Time: 2019-07-19T12:47:49.623878746-05:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hub.home.local
ubuntu:32000
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
/etc/os-release):NAME="Ubuntu"
VERSION="19.04 (Disco Dingo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.04"
VERSION_ID="19.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=disco
UBUNTU_CODENAME=disco
2 is invalid in kubeadm IIRC?
Ah, I hadn't seen that in the documentation yet... this was my first attempt to spin up an HA cluster. I'll try it again with 3 control planes like the example and make sure that works.
Assuming that's my problem, it would be nice to validate the configuration and throw a friendly error when people try to do this (assuming I'm not the only one to ever do it). I'm not really a go guy but I may see if I can figure something out and submit a pull request as penance for my not reading the documentation.
Thanks!
yes, sorry about that, I'm digging around trying to find something authoritative on the kubeadm side of things, I can't actually find anything now but I could have sworn that it only supports 3 control planes specifically.
IF / when we get that confirmed we should document in the kind docs and validate. If that's not true, then we have some bug here to fix :sweat_smile:
cc @neolit123 @fabriziopandini
so we should validate this to be odd, but also we probably need to look at getting some docs to clarify this upstream,.
the cert copy issue may be unrelated to the number however, I'll circle back to this one as I'm triaging a few other issues..
I am pretty sure 2 cp 3w kind cluster worked for me recently. It should
create fine and work, but etcd cannot make decisions. You need 3 cp for
that.
On Jul 20, 2019 00:13, "Jon Stelly" notifications@github.com wrote:
Ah, I hadn't seen that in the documentation yet... this was my first
attempt to spin up an HA cluster. I'll try it again with 3 control planes
like the example and make sure that works.Assuming that's my problem, it would be nice to validate the configuration
and throw a friendly error when people try to do this (assuming I'm not the
only one to ever do it). I'm not really a go guy but I may see if I can
figure something out and submit a pull request as penance for my not
reading the documentation.Thanks!
â
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/724?email_source=notifications&email_token=AACRATABVKR36DHGBTHRYPDQAIU7VA5CNFSM4IFJOC32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MY5ZI#issuecomment-513380069,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACRATG4HVUSCDTM26QPV7TQAIU7VANCNFSM4IFJOC3Q
.
I just tried a config with 3 control-planes and got the same error.
It is worth noting I might be trying to do too much on my dev machine. I'm using microk8s on this box for my development, I was really just using kind to spin up a quick larger cluster for testing some scaling behavior.
But I'll try to dig into this a bit.
You could be hitting resource limits of the machine. For instance on a 16gb
vm i cannot run more than 3 cps. Yet i do not see copy errors, instead the
4th apiserver starts crashlooping.
I tracked it down but will need some guidance on how to fix it correctly if you'd like a pull request. As mentioned, my go skills are near non-existent so if it's easier for one of you to make this change, I won't be offended, hopefully the below helps...
TL;DR: I'm running docker from a snap so docker doesn't have access to the host's /tmp directory that kind uses to copy around certs, etc... so the docker cp <container>:/... /tmp/... fails. It looks like kind needs to detect if docker is installed as a snap and if so, use a different temp directory.
I found some helpful info about snaps and directories here, including this command: snap run --shell docker.docker to get a shell where you can get the SNAP variables with env | grep SNAP.
I hackedskillfully changed the temp directory as follows and was able to spin up a cluster with 2 or 3 control planes (though I'm guessing the 2 control planes aren't really HA since etcd uses raft):
diff --git a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
index 5283d98..b1b26e2 100644
--- a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
+++ b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
@@ -31,7 +31,6 @@ import (
"sigs.k8s.io/kind/pkg/cluster/nodes"
"sigs.k8s.io/kind/pkg/concurrent"
"sigs.k8s.io/kind/pkg/exec"
- "sigs.k8s.io/kind/pkg/fs"
)
// Action implements action for creating the kubeadm join
@@ -145,13 +144,9 @@ func runKubeadmJoinControlPlane(
// creates a temporary folder on the host that should acts as a transit area
// for moving necessary cluster certificates
- tmpDir, err := fs.TempDir("", "")
- if err != nil {
- return err
- }
- defer os.RemoveAll(tmpDir)
+ var tmpDir = "/home/jon/snap/docker/current/tmp"
- err = os.MkdirAll(filepath.Join(tmpDir, "/etcd"), os.ModePerm)
+ var err = os.MkdirAll(filepath.Join(tmpDir, "/etcd"), os.ModePerm)
if err != nil {
return err
}
@@ -170,11 +165,11 @@ func runKubeadmJoinControlPlane(
tmpPath := filepath.Join(tmpDir, fileName)
// copies from bootstrap control plane node to tmp area
if err := controlPlaneHandle.CopyFrom(containerPath, tmpPath); err != nil {
- return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+ return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", controlPlaneHandle, fileName, containerPath, tmpPath)
}
// copies from tmp area to joining node
if err := node.CopyTo(tmpPath, containerPath); err != nil {
- return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+ return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", node, fileName, tmpPath, containerPath)
}
}
so ... the problem is that the folder created by tmpDir, err := fs.TempDir is not accessible within snap?
@BenTheElder do you think this is something to be fixed in https://golang.org/src/os/file.go?s=10008:10029#L319 ?
so ... the problem is that the folder created by
tmpDir, err := fs.TempDiris not accessible within snap?
Yeah, snap's model is to sandbox each installed application unless the snap component is distributed with classic confinement. But docker uses the strict confinement policy.
@BenTheElder do you think this is something to be fixed in https://golang.org/src/os/file.go?s=10008:10029#L319 ?
It feels like /tmp is the correct thing for go's os api to return. And certainly in this situation where kind is asking for a temp directory... I can't imagine a way of changing TempDir() to account for docker (not kind) running in a snap.
One easy check to figure this out... which docker returns /snap/bin/docker for me. I'm not 100% sure if snap on different platforms like fedora, etc... use the same directory, but I would assume so. So the easy fix options seem like:
1) Detect if docker is running from /snap/bin/docker and change the temp path behavior in that case
2) Add a command-line boolean option docker-snap and add a note to documentation/readme
3) Add a command-line option temp-path and document this snap case where it would be useful (this might be useful elsewhere, but probably not imho)
"1" seems like the better option to me.
snap detection can happen in:
https://github.com/kubernetes-sigs/kind/blob/master/pkg/fs/fs.go#L31
Ah so, Docker installed via snap is listed under known issues and mentions TMPDIR=$HOME as an option that requires no modifications. https://kind.sigs.k8s.io/docs/user/known-issues/#docker-installed-with-snap
We don't need to use a temp directory to copy these in the first place and we should consider refactoring that, but in the meantime:
TEMPDIR and we have documented this (perhaps we need to make this more visible though??) This is also standard on UNIX / POSIX and many tools respect it. This functionality comes from the go standard library ** we have our own method that wraps this to internally deal with a minor oddity of the macOS platform...
Note that we _do_ need a tempdir for building node images, so if you do that you will find the same issue there. Otherwise we've avoided this
i'd close this ticked as the issue is document in https://github.com/kubernetes-sigs/kind/blob/master/site/content/docs/user/known-issues.md#docker-installed-with-snap
and extend the issue template to let the users have a look at the list of known issues before reporting.
We should still prevent and / or detect this on failure, we could avoid the staging tempdir for this usage at least but for the node build we can't.
We could start on https://github.com/kubernetes-sigs/kind/issues/39 and detect this in the failure diagnostics and point to the docs automatically... (so it stays out of the "happy path"). I've still be discussing designs for that though...
Ah, yep, setting TEMPDIR is super simple so I'll just set that up in my scripts... thanks.
And yeah, I had looked at the known issues but I was fixating on the error message and the way docker in snap is documented on the Known Issues page it didn't jump out at me. Maybe it would be helpful to add some detail to the error message in this case, like below? In the current form the error doesn't mention the temp path (though the debug output does).
Thanks for the quick help on this and feel free to close this issue at your convenience.
tmpPath := filepath.Join(tmpDir, fileName)
// copies from bootstrap control plane node to tmp area
if err := controlPlaneHandle.CopyFrom(containerPath, tmpPath); err != nil {
- return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+ return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", controlPlaneHandle, fileName, containerPath, tmpPath)
}
// copies from tmp area to joining node
if err := node.CopyTo(tmpPath, containerPath); err != nil {
- return errors.Wrapf(err, "failed to copy certificate %s", fileName)
+ return errors.Wrapf(err, "failed to copy certificate %s:%s from %s to %s", node, fileName, tmpPath, containerPath)
}
}
/close
I think that logging has improved a lot in newer versions thanks to Ben
@aojea: Closing this issue.
In response to this:
/close
I think that logging has improved a lot in newer versions thanks to Ben
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
the tempdir will be gone in the next release https://github.com/kubernetes-sigs/kind/pull/1023