Kubespray: Kubespray 3.0 discussion

Created on 14 Jul 2020 · 17Comments · Source: kubernetes-sigs/kubespray

What would you like to be added:

Kubeadm control plane mode

kubeadm join is the recommended way for non-first control plane nodes and worker nodes. We should set kubeadm_control_plane to true by default. Not sure if it makes sense to keep the legacy 'kubeadm init everywhere' use case around. Is there any edge cases with the control plane mode?

Use etcdadm to manage external etcd cluster

There is a valid use case to have an "external" etcd cluster not managed by kubeadm, specially when etcd is not deployed on the control plane nodes. Currently, etcd setup is fairly manual, fragile (like during upgrades), and hard to debug. https://github.com/kubernetes-sigs/etcdadm is supposed to make etcd management easier. In the long run, kubeadm will eventually use etcdadm under the hood. It would be a good idea to implement it for the "external" etcd use case as well. Moreover, adding support for BYO etcd cluster (#6398) should be fairly easy if we go down that path.

Review CI matrix

Use DAG to simplify matrix as GitLab supports it. Having a CI matrix per platform (Ubuntu, RH/CentOS, Debian, ...) would be clearer for end-users in order to see what's officially supported.
Improve molecule test coverage for OS coverage. Most likely, some role rewrite is required, to make the role independent and easier to test in isolation
Use rules to run only relevant CI jobs (just markdown for docs, just specific tests for network plugin change, ...) in order to speed up the feedback loop
Separate the provisioning/setup from the validation. It would allow us to add Node conformance test as part of the CI pipeline.
Ensure all playbooks are tested properly (cluster.yml, recover-control-plane.yml, remove-node.yml, reset.yml, scale.yml, upgrade-cluster.yml)

Switch cgroup driver default to systemd

kubespray officially supports only systemd-based linux distros. We should not have two cgroup managers (see https://github.com/kubernetes/kubeadm/issues/1394#issuecomment-462878219 for technical details).
This is a backward incompatible change, so maybe default it for new install but keep the current setting for the upgrades?

Remove docker requirements

There is still some hardcoded docker commands in the code (network plugins, etcd, node role, ...). One of kubespray's goals is to "Deploy a Production Ready Kubernetes Cluster", so it should NOT have a container engine capable of building new container image by default, for security purposes. Containerd would be a more secure default setting. In order to make that transition, we need to use crictl where docker is used today.

Why is this needed:
We need to address technical debt. Code-base is wide, some areas are old and not maintained. I'd like to take the opportunity for the next major release to lean to the maximum the code-base and make the CI more agile to get quicker feedback.

/cc @floryut, @Miouge1, @mattymo, @LuckySB

kinfeature

Source

EppO

👍10 ❤1

Most helpful comment

We need to address technical debt. Code-base is wide, some areas are old and not maintained. I'd like to take the opportunity for the next major release to lean to the maximum the code-base and make the CI more agile to get quicker feedback.

Helm 3.x was released since Kubespray 2.x. It no longer requires a tiller pod and is integrated into k8s rbac. I think it would be better for Kubespray to refocus on its core competency: deploying production Kubernetes. Can include the most widely used plugins (CNI\CSI) in this. But apps that have a decent helm chart should now be deployed using that. Helm vs Ansible for deploying apps to Kubernetes is a no-brainer. Thanks to its state, Helm is truly declarative, Ansible is not. For example, uninstall a helm release and your app is removed from k8s, undefine an addon in Kubespray (eg cert_manager_enabled=false), and it remains. Most helm charts are better maintained than the addons in this project. I get the desire for Kubespray to be a one-stop-shop, so could either replace the addons with simple readme guidance explaining how to install the former addons using helm, or if workable could install the helm client and version-pinned helm charts using Kubespray.

Would significantly simplify this project and the maintenance burden.

holmesb on 6 Oct 2020

👍4

All 17 comments

Remove docker requirements

There is still some hardcoded docker commands in the code (network plugins, etcd, node role, ...). One of kubespray's goals is to "Deploy a Production Ready Kubernetes Cluster", so it should NOT have a container engine capable of building new container image by default, for security purposes. Containerd would be a more secure default setting. In order to make that transition, we need to use crictl where docker is used today.

I'm all in for that, a PR was raised a long time ago to set containerd as the default runtime (but was drop as too much work and too much breaking change), but that would allow us to get rid of a lot of docker default commands and at the same time move toward something more CRI oriented.

floryut on 16 Jul 2020

👍2

RELEASE.md says:

Kubespray doesn't follow semver. [...] Breaking changes, if any introduced by changed defaults or non-contrib ansible roles' playbooks, shall be described in the release notes.

AFAIK we already did non-backwards compatible change in the v2.x of Kubespray (when moving to kubeadm for instance). The "production ready" party is a lot about providing a path for people to move from v2.X and v2.(X+1).

What I'm saying is that we can do breaking changes (like changing default container engine) as long as they are accepted by the community and well documented.

@EppO I thought non-kubeadm was removed in #3811 is there some other things that need clean-up? kubeadm is the only supported deployment method since v2.9.

For the GitLab CI rules: and only:changes, last I checked GitLab CI (via Failfast) is unaware of the target branch, and therefore doesn't know against what to compare, the fallback mechanism explained here is problematic for PRs with multiple commits.
Another area to consider, is that Prow has support for such features (see run_if_changed in https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md)

For conformance tests, there is sonobuoy_enabled: true available and I think it's enabled on 2 CI jobs currently: config and output

@MarkusTeufelberger has some very valuable input on role design and molecule, raised a couple of issues around it. Examples: #4622 #3961

Miouge1 on 16 Jul 2020

RELEASE.md says:

Kubespray doesn't follow semver. [...] Breaking changes, if any introduced by changed defaults or non-contrib ansible roles' playbooks, shall be described in the release notes.

AFAIK we already did non-backwards compatible change in the v2.x of Kubespray (when moving to kubeadm for instance). The "production ready" party is a lot about providing a path for people to move from v2.X and v2.(X+1).

Good to know. I was more worried about end-users that may not know this and end up breaking some production clusters while trying to upgrade, hence a 3.0 proposal that is more explicit on that kind of breaking changes.

@EppO I thought non-kubeadm was removed in #3811 is there some other things that need clean-up? kubeadm is the only supported deployment method since v2.9.

I missed it because I didn't change my inventory for a while and some deprecated options are still there. I think it would be beneficial for end-users as well to list deprecated inventory options for each releases. I guess I'm not the only one with some old settings :)

For the GitLab CI rules: and only:changes, last I checked GitLab CI (via Failfast) is unaware of the target branch, and therefore doesn't know against what to compare, the fallback mechanism explained here is problematic for PRs with multiple commits.
Another area to consider, is that Prow has support for such features (see run_if_changed in https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md)

I hear you. We can't use pipelines for merge requests because we don't create the merge request in GitLab, so that's a dead end. But I'm convinced we should architect the CI around better changes detection to get a quicker feedback loop, if prow is an option, we should look at it.

For conformance tests, there is sonobuoy_enabled: true available and I think it's enabled on 2 CI jobs currently: config and output

I guess we have some work to do in that area then :)

The maximum supported Kubernetes version is 1.16.99, but the server version is v1.18.5. Sonobuoy will continue but unexpected results may occur.

Ideally we should run conformance tests regularly to test various setup combinations and not wait release time to pass the full conformance tests. That's why I was suggesting to separate them from the install/upgrade use cases.

EppO on 17 Jul 2020

etcd_kubeadm_enabled: false

What about etcd? Should we change that default to true? It makes etcd upgrades impossible outside of kubernetes upgrades, kubeadm still doesn't support upgrading etcd without the kubernetes components AFAIK.

EppO on 30 Jul 2020

👍1

Add CI job to test scale playbook

EppO on 31 Jul 2020

Add CI job to test scale playbook

I also though about that, scale and remove needs some love from CI

floryut on 31 Jul 2020

👍1

Flip default of var kubeadm_control_plane to true and remove "experimental" from code?

hafe on 4 Aug 2020

etcd_kubeadm_enabled: true
makes all etcdctl related use cases not to work.
There is also no backup procedure with kubeadm managed etcd
I started looking at it

hafe on 4 Aug 2020

Flip default of var kubeadm_control_plane to true and remove "experimental" from code?

That's actually what I was referring to with "Drop non-kubeadm deployment" but I mixed two different use cases: since 2.9 kubespray is _always_ using kubeadm to provision the cluster but it doesn't use kubeadm join on the non-first control plane nodes by default (just another run of kubeadm init).
I think the join model is the good way forward.

EppO on 4 Aug 2020

Personally I'd like to drop a few features that are relatively exotic or easy to work around/implement yourself such as downloading binaries and rsync'ing them around instead of just fetching them on each node. This could really simplify the download role.

Another bigger architectural change could be to change kubespray into a collection (maybe even adding some roles to https://github.com/ansible-collections/community.kubernetes eventually and/or using them here?) and in general switching to Ansible 2.10.

MarkusTeufelberger on 5 Aug 2020

👍2

Personally I'd like to drop a few features that are relatively exotic or easy to work around/implement yourself such as downloading binaries and rsync'ing them around instead of just fetching them on each node. This could really simplify the download role.

I'd prefer to rely on the distro package manager when applicable instead of downloading all the stuff but if you have a better design for the download role, feel free to submit a PR.

Another bigger architectural change could be to change kubespray into a collection (maybe even adding some roles to https://github.com/ansible-collections/community.kubernetes eventually and/or using them here?) and in general switching to Ansible 2.10.

Ansible 2.10 is not released yet and we need to be careful on what ansible version is available on each supported distros.
Regarding the usage of kubespray, I know @Miouge1 wanted to promote the container image use case, where you build your own custom image with your inventory and custom playbooks. That makes definitely sense in a CI pipeline.

EppO on 5 Aug 2020

Reducing scope and configurability of Kubespray would be nice.
List of features that could be removed:

canal
???

hafe on 14 Aug 2020

The more I think about it, the more I'm convinced kubespray should only provision kubernetes clusters on top of kubeadm, so we should only support the following 2 use cases on the etcd front:

BYO etcd (either by using etcdadm or other means, out of scope of kubespray)
etcd managed by kubeadm

That means removing the etcd_deployment_type mode kubespray supports today. We would still test the BYO etcd use case in the CI though.

EppO on 11 Sep 2020

👍1

The more I think about it, the more I'm convinced kubespray should only provision kubernetes clusters on top of kubeadm, so we should only support the following 2 use cases on the etcd front:

BYO etcd (either by using etcdadm or other means, out of scope of kubespray)

etcd managed by kubeadm

That means removing the etcd_deployment_type mode kubespray supports today. We would still test the BYO etcd use case in the CI though.

The same we could formulate in some kind of design statement how Kubespray embrace, use and extend kubeadm. Not workaround it

hafe on 11 Sep 2020

We need to address technical debt. Code-base is wide, some areas are old and not maintained. I'd like to take the opportunity for the next major release to lean to the maximum the code-base and make the CI more agile to get quicker feedback.

Would significantly simplify this project and the maintenance burden.

holmesb on 6 Oct 2020

👍4

I think we are very close to be able to use kubeadm managed etcd as the default.
What do you think about that?

hafe on 3 Nov 2020

Maybe we could deal with Helm apps in a separate github project ?

This project would only focus on :

sticking apps version to some kubernetes if needed
structure some playbooks and roles to easily deploy with kubepsray structured inventory (ie run helm on masters)
maybe deploy helm ?

Some attached CI would not require kubsepray deployment : only inventory plus any kubernetes should be enough. This would avoid people to rewrite their own helm addons playbooks and roles.

EDIT: first mentioned dashboard as helm chart, bad example this is plain yaml, I removed it. Btw we may think about setting the dashboard out of kubespray scope in favor to Helm :)
EDIT 2 : after searching a bit it seems there is no helm chart for dashboard