Cluster-api: KCP: upgrade gets stuck after removing etcd member

Created on 11 Mar 2020  Â·  20Comments  Â·  Source: kubernetes-sigs/cluster-api

What happened

When we initiated a KCP upgrade on a self-hosted ("pivoted") cluster, we observed the following sequence of events:

The KCP controller created a new machine, and it successfully joined the cluster (once we got rid of the smart quotes in the kubeadm config, but that's a different story).

At this point, the KCP logs look pretty standard:

E0311 21:59:52.529099       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="there are 3 control plane nodes, but 4 etcd members" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
I0311 22:00:11.441365       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

But then something surprising happened (immediately following the above statement):

I0311 22:00:24.199072       1 leaderelection.go:288] failed to renew lease capi-kubeadm-control-plane-system/kubeadm-control-plane-manager-leader-election-capi: failed to tryAcquireOrRenew context deadline exceeded
E0311 22:00:24.199149       1 main.go:121] setup "msg"="problem running manager" "error"="leader election lost"

And when the pod restarted and came back up, the

I0311 22:00:43.533246       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0311 22:00:49.132408       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="[there are 4 control plane nodes, but 3 etcd members, node \"ip-10-0-0-1.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded, node \"ip-10-0-0-2.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded]" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

which repeats a few times for about a minute, though complaining about connection issues to different etcd members (but always ip-10-0-0-1.ec2.internal – the machine marked for removal). Then there's a spate of this error:

W0311 22:01:55.293805       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.293822       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.294615       1 http.go:392] Error reading backend response: unexpected EOF
W0311 22:01:55.294743       1 http.go:392] Error reading backend response: unexpected EOF

and then things settle down to consistently stuck on:

I0311 22:02:02.347903       1 kubeadm_control_plane_controller.go:269] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"
E0311 22:02:05.741965       1 kubeadm_control_plane_controller.go:548] controllers/KubeadmControlPlane "msg"="waiting for control plane to pass etcd health check before removing a control plane machine" "error"="[there are 4 control plane nodes, but 3 etcd members, node \"ip-10-0-0-1.ec2.internal\": failed to create etcd client: unable to create etcd client: context deadline exceeded]" "cluster"="test" "kubeadmControlPlane"="test-controlplane" "namespace"="test"

It looks like things got stuck between removing the etcd member and marking the member as having been removed from etcd via annotation, because once we manually added the kubeadm.controlplane.cluster.x-k8s.io/scale-down-etcd-member-removed the upgrade was again able to proceed. It did get stuck in the same way on the next node, but the one after that was smooth.

Summary

We noticed two things from this sequence of events:

  1. The KCP controller is not able to recover effectively from a failure in etcd membership management (the self-hosting is pretty reliably injecting a fault into this process, but is not essential to the issue)
  2. The upgrade process itself is fairly disruptive to API clients. We (the controller + the humans poking at things) observed a fair number of hung connections and refused connections during the upgrade itself

Solutioneering

Ideas for building some more resilience into the etcd membership process include:

  • Teaching the healthcheck to ignore this condition (4 nodes, 3 members, the member that's been removed can't be communicated with) even if the machine hasn't been annotated
  • Pre-marking the machine with a "about to remove" annotation before we start the etcd member removal as a declaration of intent

I'll open a separate issue around the upgrade disruptiveness for discussion on that. In this case a less disruptive upgrade process would help in that it would make failures during the removal less likely.

arecontrol-plane lifecyclactive prioritcritical-urgent prioritimportant-soon

Most helpful comment

I'd be glad to help out with testing – this is my top goal today & tomorrow.

All 20 comments

cc @rudoi @mytunguyen

It did get stuck in the same way on the next node, but the one after that was smooth.

Correction, we were 3/3. The controller consistently lost leader election, once per control plane machine.

Had a quick chat with Seth, Mytu, and @vincepri. We think something that could help here would be to abandon the kubeadm.controlplane.cluster.x-k8s.io/scale-down-etcd-member-removed and instead always check if the member for the selected node exists, then remove member and delete Machine if it does exist.

@detiber @dlipovetsky thoughts on this? I know it was added for reentrancy concerns, but the proposed replacement is idempotent and allows us to not need to explicitly skip this action.

instead always check if the member for the selected node exists, then remove member and delete Machine if it does exist.

That is idempotent, and so KCP should remain idempotent. +1 from me.

@chuckha unfortunately after re-reading the code i don't think my proposal is sufficient on its own. if the pod fails and needs to restart, it's actually the health check that will block it, not the etcd member removal bit.

I think we somehow need to capture in the health check (that occurs during etcd member removal) the fact that there is a point in time where we _expect_ to have one more Machine than we do etcd members. :\

We chatted offline and this is a new state that the code simply has to keep track of. I'm working on a fix that mimics the known workaround strategy: Apply annotation and have the health check know that it doesn't need exactly 1:1

/milestone v0.3.2
/assign
/priority important-soon
/lifecycle active

/priority critical-urgent

@chuckha what is the fix you're referring to?

@vincepri The one that @rudoi uses where they basically apply an annotation to put a machine in a certain state to bypass the part where it gets stuck. I'm not doing that exactly but instead going to capture this state as a valid state and improve the healthcheck.

Can we work out a solution that doesn't require annotations on objects? I'd rather have ways to understand the status by querying the source

I think improving the healthcheck would remove the need for that annotation, +1 here.

sure -- can you open an issue that describes the desire to remove the existing annotation usage? it's already capturing a fair bit of state in annotations

The idea I was toying with around the healthcheck was that it should take the desired operation as well as the current state as a parameter – it seems to me the question we want to ask is not "is etcd currently healthy" but "would we expect etcd to remain healthy after we make this change?"

If we did that and surfaced the etcd member list in the kcp's status, I think we'd be closer to the ideal "look at KCP and know what the single next action to take should be" state.

yeah, the health check is really inflexible right now. it will definitely need more information like "what assertions should be made" vs "this hard-coded assertion must hold true"

Yeah. Something else to be aware of in this space is tying the assertions back to the desired state of the KCP too – e.g. #2430 happens because we look at Machines, annotations, and Nodes but not the replica count on the KCP before we decide to delete a machine.

https://github.com/kubernetes-sigs/cluster-api/blob/d06be1c58073953bc21e3d91762bd98dd28290c9/controlplane/kubeadm/controllers/kubeadm_control_plane_controller.go#L534-L548

This annotation can probably be removed today if we query etcd and check if the current node is in the list or not. wdyt?

After chatting a little more with @chuckha seems we have not 1, 2, but multiple issues in here! The one I posted above is the next step, after we fix the healthcheck / make it more flexible.

Let's take it one bit at a time and make sure to test it after each iterations (because more issues might pop up).

I am anti-testing. Please ship the code immediately.

-waits for this comment to show up in a performance review-

I'd be glad to help out with testing – this is my top goal today & tomorrow.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mboersma picture mboersma  Â·  5Comments

fabriziopandini picture fabriziopandini  Â·  5Comments

gyliu513 picture gyliu513  Â·  6Comments

fabriziopandini picture fabriziopandini  Â·  3Comments

timothysc picture timothysc  Â·  6Comments