Kind: create HA cluster is flaky

Created on 4 Jun 2019 · 62Comments · Source: kubernetes-sigs/kind

What happened:
i started seeing odd failures in the kind-master and -1.14 kubeadm jobs:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-1.14

after switching to this HA config:

# a cluster with 3 control-planes and 3 workers
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

I0604 19:15:09.075770     760 join.go:480] [preflight] Retrieving KubeConfig objects
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized 
 ✗ Joining more control-plane nodes 🎮
DEBU[22:15:10] Running: /usr/bin/docker [docker ps -q -a --no-trunc --filter label=io.k8s.sigs.kind.cluster --format {{.Names}}\t{{.Label "io.k8s.sigs.kind.cluster"}} --filter label=io.k8s.sigs.kind.cluster=kind] 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
DEBU[22:15:10] Running: /usr/bin/docker [docker rm -f -v kind-control-plane2 kind-control-plane kind-control-plane3 kind-worker kind-worker3 kind-worker2 kind-external-load-balancer] 
⠈⠁ Joining more control-plane nodes 🎮 Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1

What you expected to happen:
no errors.

How to reproduce it (as minimally and precisely as possible):

cd kind-src-path
GO111MODULE=on go build
# install the kind binary to PATH
cd kubernetes-src-path
kind build node-image --kube-root=$(pwd)
kind create cluster --config=<path-to-above-ha-config> --image kindest/node:latest

Anything else we need to know?:

i cannot reproduce the bug without --loglevel=debug.
sometimes it fails during joining the extra CP nodes, something during joining the workers.

Environment:

kind version: (use kind version): master at 43bf0e2594db
Kubernetes version: master at 1409ff38e5828f55
Docker version: (use docker info):

Containers: 10
 Running: 7
 Paused: 0
 Stopped: 3
Images: 128
Server Version: 18.06.3-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-41-generic
Operating System: Ubuntu 17.10
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.66GiB
Name: luboitvbox
ID: K2H6:2I6N:FSBZ:S77V:R5CQ:X22B:VYTF:WZ4R:UIKC:HGOT:UCHD:GCR2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="17.10 (Artful Aardvark)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 17.10"
VERSION_ID="17.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=artful
UBUNTU_CODENAME=artful

/kind bug
/priority important-soon
(?)

kinbug lifecyclactive prioritimportant-soon

Source

neolit123

Most helpful comment

@aojea

I think that the kubeadm join on kind-control-plane3 shouldn't finish and mark the node as ready

Technically speaking Kubeadm doesn't mark the node ready; it just lay out some static pods manifest and then it is the kubelet that starts pods/reports about pod state/node state.

However, what I'm going to investigate is the part of code in kubeadm that adds a new etcd member:
if it is possible to detect somehow that etcd is technically up, but not fully synchronized yet/not ready to respond to API server query, we can make kubeadm wait for etcd before starting the new instance of the API server.

TBH I was expecting the API server already doing this, but looking at your excellent log analysis this is not the case

fabriziopandini on 7 Jun 2019

👍2

All 62 comments

seems to work with 3e5f29995ec76bb
possible regression in https://github.com/kubernetes-sigs/kind/pull/585

EDIT: never mind, was able to reproduce it with v0.3.0 (flakes?)
EDIT2: v0.2.0 seems to work consistently.

/assign @BenTheElder

@BenTheElder i'm going to switch the kubeadm jobs to use a stable kind version (kubetest/kind needs update to pin a newer stable than 0.1.0):
https://github.com/kubernetes/test-infra/pull/12883

possibly the sig-testing kind jobs can remain to use kind from master.

neolit123 on 4 Jun 2019

I can´t reproduce it neither find a similar error in the jobs, can you give me a link to some logs? I´m not familiar with testgrid and prow 😅

I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

The errors is 401 Unauthorized 🤔

aojea on 4 Jun 2019

this:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

happended here:

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1135916134365663234/build-log.txt

it's flaky and i'm able to reproduce it locally with with --loglevel=debug. happens sometimes.
make sure that you use the HA config too.

some of the other failures here, however are different:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

e.g.
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136007475389009922/build-log.txt

the e2e suite passes:

[1m[32mSUCCESS![0m -- [32m[1m182 Passed[0m | [91m[1m0 Failed[0m | [33m[1m0 Pending[0m | [36m[1m4230 Skipped[0m

but then kind export logs times out:

time="20:51:10" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-kubetest-external-load-balancer tar --hard-dereference -C /var/log -chf - .]"
2019/06/04 21:02:51 process.go:199: Abort after 30m0s timeout during kind export logs /logs/artifacts --loglevel=debug --name=kind-kubetest. Will terminate in another 15m
SIGABRT: abort

neolit123 on 5 Jun 2019

@aojea
i think the kind export logs timeout is reproducible consistently with the HA config.
it just stays at:

DEBU[01:17:24] Running: /usr/bin/docker [docker exec --privileged kind-external-load-balancer tar --hard-dereference -C /var/log -chf - .]

neolit123 on 5 Jun 2019

also:
kind build node-image --kube-root=$(pwd)
for k/k commit 1409ff38e5828f55

gives me some CRI errors:

INFO[2019-06-04T22:36:46.255776190Z] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" 
ERRO[2019-06-04T22:36:46.259945113Z] Failed to load cni during init, please check CRI plugin status before setting up network for pods  error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
INFO[2019-06-04T22:36:46.263918747Z] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2019-06-04T22:36:46.264185013Z] Start subscribing containerd event           
INFO[2019-06-04T22:36:46.265648130Z] Start recovering state                       
INFO[2019-06-04T22:36:46.266610875Z] Start event monitor                          
INFO[2019-06-04T22:36:46.266651618Z] Start snapshots syncer                       
INFO[2019-06-04T22:36:46.266663684Z] Start streaming server                       
INFO[2019-06-04T22:36:46.266727343Z] serving...                                    address="/run/containerd/containerd.sock"
INFO[2019-06-04T22:36:46.266748482Z] containerd successfully booted in 0.099826s  
ERRO[2019-06-04T22:36:46.885194357Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.112182694Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.666346302Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:48.154099576Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.001964021Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.503738885Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:50.248597676Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:50.630062574Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:50.656049117Z] ImageCreate event &ImageCreate{Name:sha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:50.657063170Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:51.166245090Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:51.618692113Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:52.051331802Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:52.100652939Z] ImageCreate event &ImageCreate{Name:sha256:ea22abc123a1c4162fdd6e37a6594b242d5b83284faf8b5dff1fcaaf6c11ae48,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:52.101579489Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:52.512814469Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:52.979902246Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:53.243875342Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:53.267360225Z] ImageCreate event &ImageCreate{Name:sha256:19bb968f77bba3a5b5f56b5c033d71f699c22bdc8bbe9412f0bfaf7f674a64cc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:53.268120372Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.324315128Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.328863044Z] ImageCreate event &ImageCreate{Name:sha256:25876c9e6eb0c50f882e6e586f1a41f98d19f44fb94677877d8c096c5b2249c2,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.329891024Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.412397894Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.421736948Z] ImageCreate event &ImageCreate{Name:sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.422692034Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:57.098307600Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-repositories locked: unavailable" ref=tar-repositories total=142
INFO[2019-06-04T22:36:57.229823963Z] ImageCreate event &ImageCreate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.280581565Z] ImageCreate event &ImageCreate{Name:sha256:f227066bdc5f9aa2f8a9bb54854e5b7a23c6db8fce0f927e5c4feef8a9e74d46,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.281617808Z] ImageUpdate event &ImageUpdate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.301074980Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.318367087Z] ImageCreate event &ImageCreate{Name:sha256:c155832fa95abf9b0c33d3f9c39ad43bf55002b853c44394a6dbb89dcc68f55a,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.319170150Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.320687117Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.326232183Z] ImageCreate event &ImageCreate{Name:sha256:7dd88ed8b36300ced4153dcc2d0e4995378da059be497d31271771b19976fedc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.327088405Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.575908437Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/etcd:3.3.10,Labels:map[string]string{},} 
INFO[2019-06-04T22:37:00.580934619Z] ImageCreate event &ImageCreate{Name:sha256:2c4adeb21b4ff8ed3309d0e42b6b4ae39872399f7b37e0856e673b13c4aba13d,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.581375948Z] Stop CRI service

EDIT: i think the cause for:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

is a bad node-image.

neolit123 on 5 Jun 2019

"CRI errors" during kind build node-image are normal.

kind export logs and kind load ... appear to be broken on HA because they're trying to do that against the nginx node. has nothing to do with CRI etc.

BenTheElder on 5 Jun 2019

a couple of problems:

kind export logs fails https://github.com/kubernetes-sigs/kind/issues/589
Joining xxxxx nodes fails?

looks like --loglevel=debug is not a factor after all.

lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✗ Joining more control-plane nodes 🎮 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1
lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining more control-plane nodes 🎮 
 ✓ Joining worker nodes 🚜 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info

neolit123 on 5 Jun 2019

593 fixed the log export and image load issues, not sure about whatever flakiness is about https://github.com/kubernetes-sigs/kind/pull/593

BenTheElder on 5 Jun 2019

https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

the latest green run passes with kind from master.
but the previous run has the Unauthorized in 3567 milliseconds\nerror execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized"

the one before that had the timout problem that https://github.com/kubernetes-sigs/kind/pull/593 fixed.

a run that has https://github.com/kubernetes/test-infra/pull/12883 should be in 1 hour.

neolit123 on 5 Jun 2019

I narrowed it down to this snippet, what can be the problem? what means invalid bearer token?

Starting autoregister controller
2019-06-05T18:52:11.653957701Z stderr F I0605 18:52:11.653932       1 cache.go:32] Waiting for caches to sync for autoregister controller
2019-06-05T18:52:11.657095624Z stderr F E0605 18:52:11.656920       1 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/172.17.0.7, ResourceVersion: 0, AdditionalErrorMsg: 
2019-06-05T18:52:11.682115562Z stderr F E0605 18:52:11.671678       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.682181186Z stderr F E0605 18:52:11.673921       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.754070454Z stderr F I0605 18:52:11.753862       1 cache.go:39] Caches are synced for AvailableConditionController controller

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_kube-apiserver-kind-kubetest-control-plane3_58215256c50a36017e211d17dfbfd67c/kube-apiserver/0.log

aojea on 6 Jun 2019

bootstrap tokens are technically bearer tokens:
https://kubernetes.io/docs/reference/access-authn-authz/bootstrap-tokens/
AFAIK, this error can happen if kubeadm join passed a bad JWT (the --token part of join).

not sure how a token can flake though...
were you able to reproduce it locally?

on a side note, kubeadm does have concurrent CP join in k/k master now:
xref https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/create/actions/kubeadmjoin/join.go#L90-L95
(we can probably try it after this ticket is closed)

neolit123 on 6 Jun 2019

also the token in kind is hardcoded:
https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/kubeadm/const.go

neolit123 on 6 Jun 2019

@neolit123 the job seems stable now https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master , doesn´t it?

aojea on 6 Jun 2019

yes, but we switched it to kind v0.2.0, because it's in the release-informing dashboard and the release team wants it green.
https://github.com/kubernetes/test-infra/pull/12883

neolit123 on 6 Jun 2019

Seems that the etcd cluster wasn´t ready, in this case, the kind-control-plane3 is the one that wasn´t able to join the cluster at the moment of the authentication.
A hypothesis is that the LB forwarded the worker3 join connection to it and that´s why it failed, since the etcd cluster wasn´t ensambled. Should kubeadm handle this scenario?

I think that kind is too fast at this moment :)

2019-06-05T18:52:01.871076762Z stderr F 2019-06-05 18:52:01.870776 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05T18:52:01.871232212Z stderr F 2019-06-05 18:52:01.871073 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05T18:52:01.877611555Z stderr F 2019-06-05 18:52:01.877310 I | embed: rejected connection from "172.17.0.7:50372" (error "remote error: tls: bad certificate", ServerName "")
2019-06-05T18:52:06.918508278Z stderr F 2

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_etcd-kind-kubetest-control-plane3_5c9595671afd08252eb2e03d05489ee5/etcd/0.log

aojea on 6 Jun 2019

this is a possibility yes. but the kinder HA jobs are consistently green in this regard.
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-master-on-stable
i wonder what ends up being different.

i think we should bisect kind v0.2.0 -> latest and find what hapended.

neolit123 on 6 Jun 2019

👍1

That´s a fantastic idea, I think that I know what´s happening, let me do some tests

aojea on 6 Jun 2019

/assign @aojea

aojea on 6 Jun 2019

I am experiencing this flakiness as well. Let me know if I can help test any fixes!

thomastaylor312 on 6 Jun 2019

👀2 👍1

@aojea @neolit123

One additional question:

is it possible that health check on the load balancer are not working properly (the lead balancer lets traffic pass before the API server is fully operative - with etcd)?

And some answers (hopefully helpful)

kinder HA jobs are consistently green in this regard.

Kinder has wait loops after each action and it is less aggressive than kind in terms of parallelism; e.g.
https://github.com/kubernetes/kubeadm/blob/30be64eaa19c663dd1f8e8d9826608e52d06607c/kinder/pkg/actions/util.go#L85

It is possible to pass --wait=0 to get a behaviour more similar to what is implemented in kind, but parallelism for join is not there at all.

Should kubeadm handle this scenario?

Afaik kubeadm join implements a wait loop, but as soon as one API server answers it proceed with joining.
I will try to investigate if there are some regression there even if it seems that the problem is not on the waiting side but on the answering side (kube-apiserver answers before everything/etcd/RBAC is not properly set)

fabriziopandini on 7 Jun 2019

@fabriziopandini sorry for not updating the ticket, you are absolutely right, the problem is that nginx is only doing health checks at TCP level, however, haproxy was doing health checks at the TLS level with the option ssl-hello-chk.
The problem is that as soon as the TCP port is open nginx forwards traffic to it, but it can happen that the kube-apiserver is now working, thus the join fails.
I was doing research yesterday on load balancers, basically, we need SSL passthrough (L4 load balancing) with SSL health checks :

haproxy, is the best solution because it has the ssl-hello-chk option, however the docker container doesn't have a default configuration and it dies unless we bind mount the config file, we can change the entrypoint with an sleep forever per example , is this an option?
nginx, active health checks seem to be only available on the commercial version or with custom modules
traefik, TCP services is only available in alpha in the last version 2.0.0, the docs doesn't mention health checks. It's very powerful and flexible and can be used to solve the port forwarding too, something to check in the future.
envoy, seems a good alternative, want to test it today

Platforms/Architectures images: haproxy and nginx seems to have docker images for almost all platforms, but traefik and envoy are starting to support more platforms

aojea on 7 Jun 2019

Alternatively we can not LB to all nodes while they are not joined yet.

BenTheElder on 7 Jun 2019

Since we're fully aware of which nodes are ready to serve...

BenTheElder on 7 Jun 2019

hmm, I was investigating a bit more and I don't have clear now that the problem is the LB ... 🤔 maybe the problem is on the apiserver as @fabriziopandini is suggesting or in the kubeadm join --control-plane

I was checking the code and the workers are not joining until the control-plane nodes are ready, that means that our LB should work, no need to add a health check.

Checking at one of the failed jobs the last control plane node is marked as ready at 18:52:07

18:52:07.530904     814 round_trippers.go:438] PATCH https://172.17.0.2:6443/api/v1/nodes/kind-kubetest-control-plane3 200 OK in 10 milliseconds\n\nThis node has joined the cluster and a new control plane instance was created

and the worker seems to connect correctly to the kube-apiserver, no TLS handshake errors

18:52:08.069008    1149 token.go:205] [discovery] Successfully established connection with API Server \"172.17.0.2:6443\"\nI0605 18:52:08.069085    1149 join.go:447] [preflight] Fetching init configuration\nI0605 18:52:08.069140    1149 join.go:480] [preflight] Retrieving KubeConfig objects\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\nI0605 18:52:11.672313    1149 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 3602 milliseconds\nerror execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized"

It seems that the etcd pod in the node control-plane3 is not ready at that time, it has the following log at the time of the failed join

2019-06-05T18:52:01.612898076Z stderr F 2019-06-05 18:52:01.612807 I | etcdserver: published {Name:kind-kubetest-control-plane3 ClientURLs:[https://172.17.0.7:2379]} to cluster 5af0857ece1ce0e5
2019-06-05T18:52:01.614620969Z stderr F 2019-06-05 18:52:01.614445 I | embed: serving client requests on 127.0.0.1:2379
2019-06-05T18:52:01.614772504Z stderr F 2019-06-05 18:52:01.614652 I | embed: serving client requests on 172.17.0.7:2379
2019-06-05T18:52:01.621501875Z stderr F 2019-06-05 18:52:01.621348 I | etcdserver: b0588fe16c7809d9 initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
2019-06-05T18:52:01.878295546Z stderr F 2019-06-05 18:52:01.878043 I | embed: rejected connection from "172.17.0.7:46984" (error "remote error: tls: bad certificate", ServerName "")
2019-06-05T18:52:06.916352393Z stderr F 2019-06-05 18:52:06.916069 I | embed: rejected connection from "172.17.0.7:47050" (error "remote error: tls: bad certificate", ServerName "")

The etcd pod in control-plane2

(prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05T18:52:01.871232212Z stderr F 2019-06-05 18:52:01.871073 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05T18:52:01.877611555Z stderr F 2019-06-05 18:52:01.877310 I | embed: rejected connection from "172.17.0.7:50372" (error "remote error: tls: bad certificate", ServerName "")
2019-06-05T18:52:06.918508278Z stderr F 2019-06-05 18:52:06.918316 I | embed: rejected connection from "172.17.0.7:50440" (error "remote error: tls: bad certificate", ServerName "")

and control-plane1

2019-06-05T18:52:01.58090938Z stderr F 2019-06-05 18:52:01.580730 I | rafthttp: established a TCP streaming connection with peer b0588fe16c7809d9 (stream MsgApp v2 reader)
2019-06-05T18:52:01.582170035Z stderr F 2019-06-05 18:52:01.582071 I | rafthttp: established a TCP streaming connection with peer b0588fe16c7809d9 (stream Message reader)
2019-06-05T18:52:01.591084563Z stderr F 2019-06-05 18:52:01.590896 I | rafthttp: established a TCP streaming connection with peer b0588fe16c7809d9 (stream MsgApp v2 writer)
2019-06-05T18:52:01.869409605Z stderr F 2019-06-05 18:52:01.869238 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05T18:52:01.86945345Z stderr F 2019-06-05 18:52:01.869299 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")

I think that the kubeadm join on kind-control-plane3 shouldn't finish and mark the node as ready since it seems it is not, the logs shows that is not ready until 18:52:13 however the worker try to join at 18:52:08 and that node was marked ready at 18:52:07

2019-06-05T18:52:12.663212488Z stderr F I0605 18:52:12.662891       1 storage_scheduling.go:128] all system priority classes are created successfully or already exist.
2019-06-05T18:52:13.134248912Z stderr F W0605 18:52:13.133980       1 lease.go:223] Resetting endpoints for master service "kubernetes" to [172.17.0.4 172.17.0.5 172.17.0.7]
2019-06-05T18:52:13.136213587Z stderr F I0605 18:52:13.136050       1 controller.go:606] quota admission added evaluator for: endpoints

aojea on 7 Jun 2019

I am experiencing this flakiness as well. Let me know if I can help test any fixes!

@thomastaylor312 I can't reproduce it locally, can you use --retain so the nodes are not deleted after the failure and check your etcd pods logs to see if you find this error?

embed: rejected connection from "172.17.0.7:46984" (error "remote error: tls: bad certificate", ServerName "")

I'm curious why the ServerName is empty

aojea on 7 Jun 2019

on a similar note after pinning the kubeadm-kind-* jobs to 0.2.0 for HA, the -1.14 job started consistently failing on dump cluster logs: https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-1.14

time="04:18:15" level=debug msg="Running: /usr/bin/docker [docker exec --privileged -t kind-kubetest-control-plane cat /var/log/pods/kube-system_kube-controller-manager-kind-kubetest-control-plane_e413321d294f763688239fa96919bf8d/kube-controller-manager/1.log]"
time="04:18:15" level=debug msg="Running: /usr/bin/docker [docker exec --privileged -t kind-kubetest-control-plane cat /var/log/pods/kube-system_kube-controller-manager-kind-kubetest-control-plane_e413321d294f763688239fa96919bf8d/kube-controller-manager/0.log]"
Error: exit status 1
exit status 1

trying to figure out locally why it fails on the release-1.14 branch only.

neolit123 on 7 Jun 2019

@aojea

I think that the kubeadm join on kind-control-plane3 shouldn't finish and mark the node as ready since it seems it is not, the logs shows that is not ready until 18:52:13 however the worker try to join at 18:52:08 and that node was marked ready at 18:52:07

i can get it to fail more consistently in the "joining more CP nodes" step.

a rough deflake is:

diff --git a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
index e54ddd1..f673521 100644
--- a/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
+++ b/pkg/cluster/internal/create/actions/kubeadmjoin/join.go
@@ -23,6 +23,7 @@ import (
        "path"
        "path/filepath"
        "strings"
+       "time"

        "github.com/pkg/errors"
        log "github.com/sirupsen/logrus"
@@ -91,10 +92,13 @@ func joinSecondaryControlPlanes(
        // (this is not safe currently)
        for _, node := range secondaryControlPlanes {
                if err := runKubeadmJoinControlPlane(ctx, allNodes, &node); err != nil {
+                       time.Sleep(10 * time.Second)
                        return err
                }
        }

+       time.Sleep(10 * time.Second)

obviously not a good solution, but given serial without wait does not work, i don't see how parallel will work either. staring at etcd logs now.

neolit123 on 7 Jun 2019

@aojea I'll give it a whirl today and see if I can get a failure

thomastaylor312 on 7 Jun 2019

embed: rejected connection from "172.17.0.7:46984" (error "remote error: tls: bad certificate", ServerName "")

searching on this topic, it's seems like a confusing etcd message that everyone is getting for a slightly different reason. we have it in the green runs too.

neolit123 on 7 Jun 2019

can you give this PR a shot https://github.com/kubernetes-sigs/kind/pull/598?
I can't repro locally, however, I observed that some nodes take close to 10 seconds to be ready after joining the cluster

aojea on 7 Jun 2019

@aojea That is the behavior I have seen. I wrote a brief loop in my code to wait for all the nodes to hit ready

thomastaylor312 on 7 Jun 2019

can you give this PR a shot #598?

it's pretty much the same as the diff i posted here https://github.com/kubernetes-sigs/kind/issues/588#issuecomment-499912149

so i'm going to assume it's going to work.
problem is that instead of adding this fix we should try to make the CP nodes join concurrent.

currently i'm bisecting around 0.3.0, because this works fine with 0.2.0.

I can't repro locally

i wonder what else could bet at play here.
did you try multiple times?

neolit123 on 7 Jun 2019

@aojea

I think that the kubeadm join on kind-control-plane3 shouldn't finish and mark the node as ready

Technically speaking Kubeadm doesn't mark the node ready; it just lay out some static pods manifest and then it is the kubelet that starts pods/reports about pod state/node state.

TBH I was expecting the API server already doing this, but looking at your excellent log analysis this is not the case

fabriziopandini on 7 Jun 2019

👍2

2019-06-07T14:40:48.579276475Z stdout F hostIP = 172.17.0.4
2019-06-07T14:40:48.579306826Z stdout F podIP = 172.17.0.4
2019-06-07T14:40:48.585756206Z stderr F panic: Get https://10.96.0.1:443/api/v1/nodes: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
2019-06-07T14:40:48.585774503Z stderr F 
2019-06-07T14:40:48.585802435Z stderr F goroutine 1 [running]:
2019-06-07T14:40:48.585806811Z stderr F main.main()
2019-06-07T14:40:48.585809924Z stderr F     /src/main.go:84 +0x423

btw, probably unrelated but one of the kindnet containers is always panicing for both green and red runs.

neolit123 on 7 Jun 2019

@aojea I was able to duplicate the issue and my etcd pods are showing that "bad certificate" error with an empty ServerName. Is there something else you'd like me to collect/debug?

thomastaylor312 on 7 Jun 2019

@neolit123 kindnet panics for almost everything so it can restart in a clean status, but the error seems related to this ... and is another certificate issue.
I think that we are getting closer to the problem, seems that something happens with one certificate, doesn't it?
@thomastaylor312 if you can kind export logs and attach here will be great, also it would be nice to check with openssl s_client -connect loadblancerip:6443 the differences between certificates (you have to run the command as many times as number of control pane nodes)

aojea on 7 Jun 2019

Ok, debug logs are attached @aojea.
failed-cluster-logs.tar.gz

As for the certs, each cert is different (which I _think_ is correct). I've attached the output of each one as well:
cluster-certs.tar.gz

thomastaylor312 on 7 Jun 2019

👍1

It turns out that an etcd member after joining a cluster but not yet aligned with the rest of nodes is called "learner" and there is work ongoing to get this properly handled...
https://github.com/etcd-io/etcd/issues/10537

fabriziopandini on 7 Jun 2019

👀1

the problem is that nginx is only doing health checks at TCP level, however, haproxy was doing health checks at the TLS level with the option ssl-hello-chk.
The problem is that as soon as the TCP port is open nginx forwards traffic to it, but it can happen that the kube-apiserver is now working, thus the join fails.
I was doing research yesterday on load balancers, basically, we need SSL passthrough (L4 load balancing) with SSL health checks :

after a bisecting session i was able to confirm that nginx is indeed the problem.
after the switch to containerd the LB became broken and this PR both fixed the failing LB and changed the LB from haproxy to nginx:
https://github.com/kubernetes-sigs/kind/pull/470

however after those commits i can get an HA cluster to flake roughly 50% of the time.

haproxy, is the best solution because it has the ssl-hello-chk option, however the docker container doesn't have a default configuration and it dies unless we bind mount the config file, we can change the entrypoint with an sleep forever per example , is this an option?

this is technically the recommended way to configure haproxy.
but we are going to have to run it "dind" unless we write the haproxy config on the host, correct?

envoy, seems a good alternative, want to test it today

i haven't tried envoy, but possibly worth a try as long as it's an actual improvement over the haproxy.

nginx, active health checks seem to be only available on the commercial version or with custom modules

traefik, TCP services is only available in alpha in the last version 2.0.0, the docs doesn't mention health

traefik and nginx don't seem like good options.

neolit123 on 7 Jun 2019

It turns out that an etcd member after joining a cluster but not yet aligned with the rest of nodes is called "learner" and there is work ongoing to get this properly handled...
etcd-io/etcd#10537

FYI. Learner / non-voting member is not released yet. Also, joining a cluster as learner requires additional flag to be enabled. Existing etcd membership reconfiguration API calls will not leads to new member added to cluster as learner.

jingyih on 7 Jun 2019

👍1

@neolit123 I think that the LB alleviates the problem only, the same way introducing delays alleviates it ... I think that the root cause is that the API server can´t serve the kubeadm-config ConfigMap, as you can see the TLS handshake happens without problems, thus the LB will not solve it ... I´m more inclined to think that the reason is in the etcd cluster 🤷‍♂

18:52:11.672313    1149 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 3602 milliseconds\nerror execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized"

@thomastaylor312 in your logs attached I don´t see the 401 Unauthorized error 🤔

aojea on 8 Jun 2019

👀2 👍1

right, note that the haproxy change came with making things quite a bit faster, and we've only gotten faster since :^)

finding racy problems with bringup is a good thing from kind's POV, we just need to root-cause and fix them, ideally that doesn't involve working around them by adding busy waiting in kind...

anything like that we need to do, all other consumers of kubeadm potentially need to do too.

BenTheElder on 8 Jun 2019

I did a simple test, in one terminal create a kind cluster with 3 control planes.
Quickly, and using the LB IP address, in another terminal run while true ; do curl -k https://LB IP ADDRESS:6443/healthz; sleep 1; done meanwhile the cluster is booting

You can see that the api servers are receiving requests despite not being healthy

[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed
[+]ping ok
[+]log ok
[-]etcd failed: reason withheld
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/ca-registration ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed
(sniped)
+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
healthz check failed
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokok{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/healthz\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}okokokokokokokokokokokokokokokokokokokokokokokokokokokokok[+]ping ok
[+]log ok
[-]etcd failed: reason withheld

To follow the best practices the LB should honor the health check and not forward traffic to those API servers meanwhile they are not healthy (what happen if because of some etcd issue all of them are not available?)

Another option I was trying is using the liveness and readiness probes in the kube-apiserver pod, if we are able to stop the pod receiving traffic in the API server exposed port, nginx will detect it and not forward traffic. But seems that current timers are not too aggressive

```
livenessProbe:
failureThreshold: 8
httpGet:
host: 172.17.0.4
path: /healthz
port: 6443
scheme: HTTPS
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15
````

My understanding of Readiness is that only works for Services, should it work here?
Can we use more agressive timers for the liveness probe?

aojea on 9 Jun 2019

@aojea thanks for your great work trying to address this issue.

Before answering to your question, le me recap the options that in my opinion should be considered for solving this issue:

act on the API server, implementing a change for not answering while not healthy
act on kubeadm, implementing a wait cycle that detects when then new etcd member is fully syncronized, and then create the new API server only after this condition is satisfied
act on kubeadm, tuning probes
4 act on kind load balancer

IMO 1 is the best option, but it will take some time

I'm in contact with etcd maintainers for implementing 2 in a smart way (I don't like the idea to add arbitrary sleep). ATM there is a promising feature still only on master; let's see if there is also something immediately actionable.

You are looking at 2/3

With regards to timers, they are the results of a painful trade-off between different types of kubeadm users. We should be really careful in touching those values unless we find something badly wrong or we can assume that the change doesn't hurt users with slower devices (raspberry is the corner case in this area).

AFAIK, licenses probe isn't the best candidate for this use case (detecting things are not started); I'm also trying to understand if readiness/liveness can actually block traffic for a static pod using host networking...

fabriziopandini on 10 Jun 2019

👍1

xref https://github.com/kubernetes/kubernetes/issues/72102

neolit123 on 10 Jun 2019

👀2

My understanding of Readiness is that only works for Services, should it work here?
Can we use more agressive timers for the liveness probe?

immediate answer - no, we cannot as these are well balanced. we have found cases where the values do not match a use case, but normally there is another issue at play there.

neolit123 on 10 Jun 2019

👍1

I've looked at envoy and I'm afraid it can not solve the problem, seems it doesn't have (at least I couldn't find it) the option to do https health checks against the API /healthz endpoint (it does http only) . If anybody wants to give a shot this is the configuration I've tried

envoy config

docker run -d envoyproxy/envoy-alpine:v1.10.0
docker run -it ENVOYCONTAINER sh
cat <<EOF > /etc/envoy/envoy.yaml
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 9901
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.tcp_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          stat_prefix: kind_tcp
          cluster: kind_cluster
  clusters:
  - name: kind_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: kind_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 172.16.0.4
                port_value: 6443
    health_checks:
      - timeout: 1s
        interval: 2s
        interval_jitter: 1s
        unhealthy_threshold: 1
        healthy_threshold: 3
        http_health_check:
          path: "/healthz"

exit

docker restart ENVOYCONTAINER

I think we should go with the 1- 3 options mentioned by fabrizio and, if we want to work around the problem until we found the definitive solution, we can go for the active loop https://github.com/kubernetes-sigs/kind/pull/598.

aojea on 11 Jun 2019

/unassign @aojea

aojea on 11 Jun 2019

can't seem to reproduce the problem after the workaround in https://github.com/kubernetes/kubernetes/pull/78915
cc @ereslibre

the issue of CM retries is now tracked in https://github.com/kubernetes/kubeadm/issues/1606

neolit123 on 14 Jun 2019

👀2

Thanks @neolit123 and @ereslibre , that's awesome
Should we close if jobs are stable during some days?

aojea on 14 Jun 2019

Should we close if jobs are stable during some days?

I'll defer this decision to others. However, in my opinion, the workaround included in kubeadm was to save the situation for flakiness temporarily, and is meant to be reverted as far as we have discussed.

Ideally, during the 1.16 cycle we should be able to understand the source(s) of problems for concurrent joins and as a result of these investigations it could happen that they circle back to kind (e.g. the LB would need to perform health checks, so we only route traffic to initialized apiservers).

ereslibre on 14 Jun 2019

NOTE: kubespray also has a mode with an internal nginx LB...
https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md#kube-apiserver

BenTheElder on 14 Jun 2019

/assign @aojea

I was talking today about this with @ereslibre , despite all the issues, one fact is that the API server is advertising the correct status in the healthz endpoint, and that's info that the LB should honor.

After trying different LBs seems that the only open source LB that allows to do health checks on an HTTPS endpoint and HTTPS passthrough load balancing are:

haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??
nginx, only the paid version or installing modules allows doing active health checks

aojea on 15 Jun 2019

haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??

The standard docs show adding your config to the image. You can put a dummy config. If we boot a sleep then restart doesn't work properly.

add a dummy config to our custom image kindest/haproxy
rewrite the config at runtime by writing it into the container
use SIGHUP to reload the config

we actually had this working at one point already :-)

BenTheElder on 17 Jun 2019

👍1

https://github.com/kubernetes-sigs/kind/pull/645 implemented haproxy with healthchecks per https://github.com/kubernetes-sigs/kind/issues/588#issuecomment-502802012

BenTheElder on 24 Jun 2019

so far I haven't gotten it to flake, still "soak testing" this locally :upside_down_face:

BenTheElder on 24 Jun 2019

https://github.com/kubernetes-sigs/kind/issues/588#issuecomment-501922740

if using k/k from master our workaround in https://github.com/kubernetes/kubernetes/pull/78915 might be kicking in.

neolit123 on 24 Jun 2019

given https://github.com/kubernetes-sigs/kind/pull/645 merged, we can locally revert https://github.com/kubernetes/kubernetes/pull/78915 and test again for flakes.

but this issue can also be closed as kind is now doing what is needed (e.g. use ha-proxy)?

neolit123 on 24 Jun 2019

@neolit123 absolutely, I will start investigating on my spare time. Will ping you back with my findings.

ereslibre on 24 Jun 2019

👍1

if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in.

I'm testing with v1.14.2 currently. (the default image until we bump it... about to do that)

BenTheElder on 25 Jun 2019

$ docker logs kind-external-load-balancer 
[WARNING] 174/214223 (1) : config : missing timeouts for frontend 'controlPlane'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] 174/214223 (1) : New worker #1 (6) forked
[WARNING] 174/214225 (1) : Reexecuting Master process
[ALERT] 174/214225 (1) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)
[WARNING] 174/214225 (6) : Stopping frontend controlPlane in 0 ms.
[WARNING] 174/214225 (6) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 174/214225 (6) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 174/214225 (6) : Proxy controlPlane stopped (FE: 0 conns, BE: 0 conns).
[NOTICE] 174/214225 (1) : New worker #1 (22) forked
[WARNING] 174/214225 (6) : Proxy kube-apiservers stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (6) : Proxy GLOBAL stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214225 (1) : Former worker #1 (6) exited with code 0 (Exit)
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane2 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane3 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214226 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214250 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214336 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer7 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 19 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214336 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214341 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214344 (22) : Server kube-apiservers/kind-control-plane2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214413 (22) : Server kube-apiservers/kind-control-plane3 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 4ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

BenTheElder on 25 Jun 2019

I think we can close now, please re-open or file a new issue if you see further problems!
Thanks everyone!

BenTheElder on 25 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings