Cluster-api: KCP should continuously reconcile endpoints in kubeadm ConfigMap Status

Created on 3 Apr 2020 · 20Comments · Source: kubernetes-sigs/cluster-api

What did you expect to happen:
As of today, KCP changes the endpoints list in the KubeadmConfig.Status when scaling down the control plane nodes.

https://github.com/kubernetes-sigs/cluster-api/blob/3950a9c5614ebb0e08e35ec6e1100ac3c40e89a3/controlplane/kubeadm/controllers/scale.go#L146

However, there are cases where the list of endpoints might change outside of the scale down sequence, like e.g. when a machine dies/gets deleted.

Therefore, on top of what already exists, we should implement an additional reconciliation function to be run during ReconcileNormal in order to ensure that the list of endpoints in KubeadmConfig.Status is in sync with the actual list of CP machines.

Anything else you would like to add:
As of today RemoveMachineFromKubeadmConfigMap updates the KubeadmConfig.Status no matter of the list is changed or not. This should be improved in order to update/patch the KubeadmConfig.Status only when the list changes

Environment:

Cluster-api version: v0.3.3

/kind bug
@vincepri @benmoss

arecontrol-plane help wanted kinbug prioritimportant-soon

Source

fabriziopandini

All 20 comments

/retitle KCP should continuously reconcile endpoints in kubeadm ConfigMap Status

vincepri on 3 Apr 2020

/milestone v0.3.x

vincepri on 3 Apr 2020

@fabriziopandini This is partially covered here: https://github.com/kubernetes-sigs/cluster-api/pull/2841
When we detect a machine deletion (machine does not exist but ETCD has it as a member), it is being removed from kubeadm ConfigMap (and from ETCD member list).

I guess separating that logic by doing the same check starting from kubeadm ConfigMap makes sense. By this way, ETCD check only cleans up ETCD members instead of changing kubeadm ConfigMap.

I can combine this into that PR. WDYT?

sedefsavas on 7 Apr 2020

Yes @sedefsavas, having a dedicated reconcile loop for the kubeadm ConfigMap IMO is more robust vs managing changes to the config map as a derivative of other workflows (scale, delete).

Hopefully, this can lead to the simplification of other workflows as well, but this is not the primary goal of this issue and we will get there incrementally.

Said that, considered #2841 I'll defer to @vincepri if to continue with this or bump priority/punt

fabriziopandini on 7 Apr 2020

Let's hold on to this until we have annotations removed from KCP, @benmoss has been working on it.

vincepri on 7 Apr 2020

👍1

We can revisit the milestone once that's merged in, one reason I would like to have one refactor at a time is that we have few folks looking/reviewing at the codebase, and we should take our time to make sure the changes are correct and have been validated.

vincepri on 7 Apr 2020

/priority important-soon

vincepri on 10 Jun 2020

/milestone v0.3.9
/help

vincepri on 31 Jul 2020

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/milestone v0.3.9
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 31 Jul 2020

/milestone v0.3.10

vincepri on 25 Aug 2020

Is anyone interested in doing this? We're cutting it a little close at this point, moving the milestone

/milestone v0.4.0

vincepri on 21 Sep 2020

/area control-plane

fabriziopandini on 12 Nov 2020

@fabriziopandini I had a conversation with @sedefsavas about this, and she was of the view that I should wait for the KCP changes to be forward ported to the main branch before introducing any other changes to KCP controller.

cc: @vincepri

srm09 on 18 Nov 2020

@srm09 I agree with @sedefsavas, I'm trying to unblock the latest forward porting PRs 😓

fabriziopandini on 18 Nov 2020

@fabriziopandini Have the PRs been forward ported into the main branch?

srm09 on 8 Dec 2020

@srm09 just to avoid misunderstanding, which PR are you talking about, 3977

fabriziopandini on 9 Dec 2020

👍1

Going back to this issue, I think that we have reconcileEtcdMembers being called outside scale up/scale down this issue can be closed.
@srm09 @vincepri opinions?

fabriziopandini on 9 Dec 2020

Guess, the reconcileEtcdMembers does make sure that the list of machines is in sync with the actual CP nodes, so for the purpose of this issue, seems like the main concern should be taken care of.

srm09 on 9 Dec 2020

👍1

/close

fabriziopandini on 10 Dec 2020

@fabriziopandini: Closing this issue.

In response to this: