Cluster-api: KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion

Created on 10 Mar 2020 · 12Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

In CAPV, I created a new cluster, followed by deleting it. Under the default of 1 worker, KCP controller will seemingly hang after the machines are deleted, eventually printing (I have added some extra log lines in my branch to trace what's going on):

[manager] I0310 08:30:06.521810       8 controller.go:171] controller-runtime/controller "msg"="Starting Controller"  "controller"="kubeadmcontrolplane"
[manager] I0310 08:30:06.521864       8 controller.go:190] controller-runtime/controller "msg"="Starting workers"  "controller"="kubeadmcontrolplane" "worker count"=1
[manager] I0310 08:31:05.532426       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532479       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532492       8 kubeadm_control_plane_controller.go:751] controllers/KubeadmControlPlane "msg"="Non control plane machines exist and must be removed before control plane machines are removed" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685832       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685914       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434291       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434861       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:31:49.435353       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:49.435468       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.432420       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.440652       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:03.441244       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:03.441342       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.436936       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.437515       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:17.437983       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:17.438080       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.961771       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.962149       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:37:20.962520       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962614       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962639       8 kubeadm_control_plane_controller.go:744] controllers/KubeadmControlPlane "msg"="no machines, ok to delete" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.953251       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986688       8 kubeadm_control_plane_controller.go:179] controllers/KubeadmControlPlane "msg"="Failed to patch KubeadmControlPlane" "error"="kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986769       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="[failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found]"  "controller"="kubeadmcontrolplane" "request"=

What did you expect to happen:
Resources to be deleted within 60s.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Increasing workers would be a suitable workaround for now. Finding a way to synchronise on a channel and cancel outstanding reconciliations on delete is maybe one potential optimisation.

Environment:

Cluster-api version: 0.3.0-rc.3
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug

arecontrol-plane kinbug lifecyclactive

Source

randomvariable

Most helpful comment

I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.

detiber on 10 Mar 2020

👍2

All 12 comments

I wonder if it would make sense to move the deferred call to r.updateStatus after the call to r.reconcileDelete in Reconcile.

detiber on 10 Mar 2020

We should probably also increase the default concurrency on KubeadmControlPlane reconciliation :joy:

detiber on 10 Mar 2020

I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.

detiber on 10 Mar 2020

👍2

/milestone v0.3.x

vincepri on 10 Mar 2020

Increasing default concurrency is a different problem and won't help in this case, but +1 on increasing it to 10 regardless.

The reason it won't help is that increasing a controller's concurrency will still only reconcile each object serially and not reconcile the same object concurrently. Since each cluster only has one KCP resource, the hanging is still a problem. This will, however, improve the situation when there are multiple control planes being managed by this controller.

I'm +1 on adding some concurrency within the control loop. Probably 1 go routine for each delete (== 1 go routine / node) sounds pretty reasonable. This would be similar to what we had originally (before health checks).

chuckha on 10 Mar 2020

The concurrency comment was more of an aside. I do think we need to consider moving the call to updateStatus to a defer that is configured after reconcileDelete has been processed. If we still need a status update during reconcileDelete, we can probably add an additional defer for it to reconcileDelete where we can add more safety around failures communicating with the workload cluster.

detiber on 10 Mar 2020

I'm thinking of investigating this further. Has anyone seen this behavior with CAPA?
Asking because I do not have access to any vSphere infrastructure and was going to see if I can reproduce this with CAPA.

wfernandes on 16 Mar 2020

@wfernandes you should be able to replicate by attempting to delete prior to the first control plane Machine becoming ready.

detiber on 16 Mar 2020

👍1

/assign

wfernandes on 16 Mar 2020

/lifecycle active

wfernandes on 20 Mar 2020

It seems that this issue was fixed as part of this PR: https://github.com/kubernetes-sigs/cluster-api/pull/2708.
/close

wfernandes on 20 Mar 2020

@wfernandes: Closing this issue.

In response to this:

It seems that this issue was fixed as part of this PR: https://github.com/kubernetes-sigs/cluster-api/pull/2708.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.