Cluster-api: Insulate users from kubeadm API version changes

Created on 25 Mar 2020 · 24Comments · Source: kubernetes-sigs/cluster-api

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

Improve kubeadm to support Cluster API
Remove duplication of effort across kubeadm and Cluster API

Non-Goals/Future Work

Fully replace kubeadm

User Story

As an operator, I want kubeadm to have better support for Cluster API's use cases to reduce the number of failed machines in my infrastructure.

Detailed Description

In a number of environments, machines can intermittently fail to bootstrap. The most common of these are control plane joins, which lead to temporary changes in etcd and API server availability, mediated by the speed of the underlying infrastructure and the particulars of infrastructure load balancers.

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations. As a long term solution, Cluster API should be a good kubeadm citizen and make changes to kubeadm to do the appropriate retries to cover the variety of infrastructure providers supported by Cluster API. In addition, the KCP controller re-implements some of the

Contract changes [optional]

Support the effort to move kubeadm out-of-tree
Make kubeadm retry operations based on data gathered from Cluster API users
Consider implementing machine-readable for kubeadm to support #2554
Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

Data model changes [optional]

Migrate to kubeadm v1beta2

[Describe contract changes between Cluster API controllers, if applicable.]

/kind proposal

aredependency kincleanup lifecyclfrozen prioritimportant-longterm

Source

randomvariable

👍1

Most helpful comment

https://github.com/kubernetes/kubeadm/issues/2261

neolit123 on 26 Aug 2020

👍2

All 24 comments

cc @neolit123 @fabriziopandini

ncdc on 25 Mar 2020

I would like to add:

Kubeadm operation should be idempotent -> so it is easier to re-execute in case of problems
Investigate usage of etcd learner mode
investigate changes to the join workflow in order to prevent problems when the API server gets available before the local etcd instance

Final thought.
Despite all the improvements we can add to kubeadm, a CLI cannot prove the same guarantees a reconciliation loop does. So, it is necessary that also ClusterAPI implements/improve the capability to detect failures in the CLIs and replace failed nodes

fabriziopandini on 25 Mar 2020

👍1

agreed to all of @fabriziopandini 's points.

kubeadm follows the philosophy of a CLI tool (like ssh, ftp, etc) and it cannot anticipate all of the infrastructure related failures. but having a sane / best-effort amount of retries in the CLI tool makes sense.

Support the effort to move kubeadm out-of-tree

hopefully scheduled for 1.19. depends a lot on sig-release and partly on sig-arch!

Make kubeadm retry operations based on data gathered from Cluster API users

this can be useful, no doubt. like i've mentioned today, interestingly we have not seen major complains about the failures CAPI is seeing. users are applying custom amount of timeout around their cluster creation on custom infrastructure (e.g. "i know what my GCE running cluster needs").

Consider implementing machine-readable for kubeadm to support #2554

@randomvariable can you expand on this point?
we have a tracking issue to support machine readable output. but not sure how does this relate to the failures. to my understanding one of the major issues we have in CAPI is that we cannot get signal if kubeadm join returned > 0.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

there is a tracking issue for that as well. it will be a long process and the timeline is unclear.
after the move, we can start working on that but for a period of time the exposed library will be unstable.

neolit123 on 25 Mar 2020

👍1

For #2254 we will likely have some component on the machine call back to an infrastructure API notification service (or back to the management cluster) to provide information about the failure. Providing users with access to the log is one case, but providing a readable output, which may actually be something along the lines of expanding the range of error codes, could update a specific condition on the machine related to the exact kubeadm failure. A controller could then take appropriate remediative action. I agree this is long-term however.

randomvariable on 25 Mar 2020

/area dependency

vincepri on 31 Mar 2020

Removing this as a proposal, rather seems like a future cleanup

/kind cleanup

vincepri on 27 Apr 2020

WRT:

Some ugly hacks have been introduced, notably #2763 to retry kubeadm operations.

in 1.19 kubeadm merged a number of fixes and backported them to 1.17, 1.18:

https://github.com/kubernetes/kubeadm/issues/2091
https://github.com/kubernetes/kubeadm/issues/2092
https://github.com/kubernetes/kubeadm/issues/2093
https://github.com/kubernetes/kubeadm/issues/2094

/assign @fabriziopandini
for evaluation of this part.

adding up-to-date comments to the rest of the tasks:

Support the effort to move kubeadm out-of-tree

[1] timeline is unclear, we are blocked on the lack of policy for component extractions out of k/k.
we have stakeholders such as sig-arch and sig-release who see this as low-prio.

Make kubeadm retry operations based on data gathered from Cluster API users

fixes above should conform this task.

Consider implementing machine-readable for kubeadm to support #2554

we did not merge any PRs in 1.19 for MRO as the contributor was busy with other tasks, but the boilerplate is in place.

Re-factor relevant parts of kubeadm into a library consumable by the bootstrap and kubeadm control plane controllers

this is very long term, potentially after [1]

Migrate to kubeadm v1beta2

v1beta1 is scheduled for removal in kubeadm 1.20 and my proposal would be to keep us on track for this effort.

neolit123 on 16 Jun 2020

😄1 👍1

/milestone v0.4.0

vincepri on 16 Jun 2020

@neolit123 thanks for the update!

fabriziopandini on 16 Jun 2020

xref my comment from https://github.com/kubernetes-sigs/cluster-api/issues/3323#issuecomment-656771293:

We should stop exposing the kubeadm v1betax types in our KubeadmConfig/KubeadmControlPlane specs, and instead use our own types. This would allow us to separate what users fill in from which kubeadm API version we end up using in our bootstrap data. As @detiber pointed out, we still have to know which version of the kubeadm types to use when generating our kubeadm yaml file and when interacting with the kubeadm-config ConfigMap.

ncdc on 10 Jul 2020

👍1

We should stop exposing the kubeadm v1betax types in our KubeadmConfig/KubeadmControlPlane specs, and instead use our own types. This would allow us to separate what users fill in from which kubeadm API version we end up using in our bootstrap data. As @detiber pointed out, we still have to know which version of the kubeadm types to use when generating our kubeadm yaml file and when interacting with the kubeadm-config ConfigMap.

Some questions: If we go this route, it sounds like we would be hand picking what gets exposed in the capi equivalent of the kubeadm types, right? Is the idea to provide a better capi abstraction to the user? Would the mapping be along the lines capi types <--> kubeadm types (v1betax) <--> kubeadm configmap? Given the large (and potentially increasing) number of fields that kubeadm exposes, wouldn't this approach lead to the issue of keeping capi types in sync with kubeadm types?

Also, it would be great to keep this issue in mind for any redesigns: https://github.com/kubernetes-sigs/cluster-api/issues/1584

Arvinderpal on 18 Jul 2020

If we go this route, it sounds like we would be hand picking what gets exposed in the capi equivalent of the kubeadm types, right? Is the idea to provide a better capi abstraction to the user?

Yes, but I think we could probably expose the majority of them in a way that makes more sense for our users. For example, with KubeadmControlPlane, we expose the full kubeadm ClusterConfiguration, which includes a field for KubernetesVersion... but we control the control plane version in KubeadmControlPlane.Spec.Version. It makes more sense to me not to expose the full ClusterConfiguration because there are fields we control elsewhere.

Given the large (and potentially increasing) number of fields that kubeadm exposes, wouldn't this approach lead to the issue of keeping capi types in sync with kubeadm types?

I do recognize this adds another layer and will likely duplicate a lot of fields between CAPI and kubeadm. However, we are currently locked to the kubeadm v1beta1 API version, and that version supports a fixed range of Kubernetes versions. v1beta1 eventually won't support newer Kubernetes versions. We know we'll eventually have to move to kubeadm v1beta2, and we'll want to move to newer API versions whenever they're available as well. I think it makes more sense for CAPI to insulate the user from kubeadm API versions: as a user, I don't want to think "My target version is Kubernetes v1.21.x - which kubeadm API version do I need?"

ncdc on 20 Jul 2020

👍1

One use case we would want to support:
Discussing with @andrewsykim , there's scenarios with quite a few of the CNIs where you don't want kube-proxy deployed. kubeadm supports this using only the CLI flag "--skip-phase", so if we are going to provide our own types, it we should see which of these CLI flags we may want to expose as API types.

randomvariable on 26 Aug 2020

i though we had a ticket about the support to skip phases via the kubeadm configuration, but apparent we don't.

i can see this being a string slice under JoinConfiguration or InitConfiguration.

neolit123 on 26 Aug 2020

👍1

i can see this being a string slice under JoinConfiguration or InitConfiguration.

💯 yes please, ^^^^^^

ncdc on 26 Aug 2020

https://github.com/kubernetes/kubeadm/issues/2261

neolit123 on 26 Aug 2020

👍2

For reference, another use case for kubeadm as a library would be to simplify CABPK retry logic from a windows perspective: https://github.com/kubernetes-sigs/cluster-api/pull/3616#discussion_r494571110. Currently the proposed solution is for the InfraMachine Spec to have an OsTpye that can be looked up but CABPK controller and provide the correct retry script. This additional logic and script in CABPK would be not be needed if kubeadm could be called as a library.

jsturtevant on 13 Oct 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 11 Jan 2021

/remote-lifecycle stale

detiber on 11 Jan 2021

/assign @randomvariable @yastij

We'll pick this up as a requirement for the node agent proposal in v1alpha4

vincepri on 12 Jan 2021

/lifecycle frozen

vincepri on 12 Jan 2021

I know we don't have a label for it, but just for tracking

/area node-agent

randomvariable on 19 Jan 2021

@randomvariable: The label(s) area/node-agent cannot be applied, because the repository doesn't have them

In response to this:

I know we don't have a label for it, but just for tracking

/area node-agent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.