What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
In CAPV, I created a new cluster, followed by deleting it. Under the default of 1 worker, KCP controller will seemingly hang after the machines are deleted, eventually printing (I have added some extra log lines in my branch to trace what's going on):
[manager] I0310 08:30:06.521810 8 controller.go:171] controller-runtime/controller "msg"="Starting Controller" "controller"="kubeadmcontrolplane"
[manager] I0310 08:30:06.521864 8 controller.go:190] controller-runtime/controller "msg"="Starting workers" "controller"="kubeadmcontrolplane" "worker count"=1
[manager] I0310 08:31:05.532426 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:31:05.532479 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:31:05.532492 8 kubeadm_control_plane_controller.go:751] controllers/KubeadmControlPlane "msg"="Non control plane machines exist and must be removed before control plane machines are removed" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:31:35.685832 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:31:35.685914 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:31:48.434291 8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:31:48.434861 8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out" "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:31:49.435353 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:31:49.435468 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:32:02.432420 8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:32:02.440652 8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out" "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:03.441244 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:32:03.441342 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:32:16.436936 8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:32:16.437515 8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out" "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:17.437983 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:32:17.438080 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:37:19.961771 8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:37:19.962149 8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out" "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:37:20.962520 8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:37:20.962614 8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] I0310 08:37:20.962639 8 kubeadm_control_plane_controller.go:744] controllers/KubeadmControlPlane "msg"="no machines, ok to delete" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:37:22.953251 8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:37:22.986688 8 kubeadm_control_plane_controller.go:179] controllers/KubeadmControlPlane "msg"="Failed to patch KubeadmControlPlane" "error"="kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default"
[manager] E0310 08:37:22.986769 8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="[failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found]" "controller"="kubeadmcontrolplane" "request"=
What did you expect to happen:
Resources to be deleted within 60s.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Increasing workers would be a suitable workaround for now. Finding a way to synchronise on a channel and cancel outstanding reconciliations on delete is maybe one potential optimisation.
Environment:
kubectl version):/etc/os-release):/kind bug
I wonder if it would make sense to move the deferred call to r.updateStatus after the call to r.reconcileDelete in Reconcile.
We should probably also increase the default concurrency on KubeadmControlPlane reconciliation :joy:
I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.
/milestone v0.3.x
Increasing default concurrency is a different problem and won't help in this case, but +1 on increasing it to 10 regardless.
The reason it won't help is that increasing a controller's concurrency will still only reconcile each object serially and not reconcile the same object concurrently. Since each cluster only has one KCP resource, the hanging is still a problem. This will, however, improve the situation when there are multiple control planes being managed by this controller.
I'm +1 on adding some concurrency within the control loop. Probably 1 go routine for each delete (== 1 go routine / node) sounds pretty reasonable. This would be similar to what we had originally (before health checks).
The concurrency comment was more of an aside. I do think we need to consider moving the call to updateStatus to a defer that is configured after reconcileDelete has been processed. If we still need a status update during reconcileDelete, we can probably add an additional defer for it to reconcileDelete where we can add more safety around failures communicating with the workload cluster.
I'm thinking of investigating this further. Has anyone seen this behavior with CAPA?
Asking because I do not have access to any vSphere infrastructure and was going to see if I can reproduce this with CAPA.
@wfernandes you should be able to replicate by attempting to delete prior to the first control plane Machine becoming ready.
/assign
/lifecycle active
It seems that this issue was fixed as part of this PR: https://github.com/kubernetes-sigs/cluster-api/pull/2708.
/close
@wfernandes: Closing this issue.
In response to this:
It seems that this issue was fixed as part of this PR: https://github.com/kubernetes-sigs/cluster-api/pull/2708.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.