Cluster-api: Add Kubernetes provider to pre-defined list

Created on 28 Mar 2020 · 23Comments · Source: kubernetes-sigs/cluster-api

Detailed Description

I have been working on an infrastructure provider implementation using Kubernetes Pods as Nodes and was wondering if it would be useful to add to the list of out-of-the-box providers: https://github.com/dippynark/cluster-api-provider-kubernetes

The provider would be useful for testing and experimentation rather than production so potentially it isn't the right fit.

/kind feature

kindocumentation lifecyclrotten

Source

dippynark

Most helpful comment

@detiber thanks for taking a look, yeah a lot of the behaviour was inspired (or copied) from the docker provider - I think the main difference around machine provisioning is that the kubernetes provider creates a systemd unit on each Node which runs the cloudinit script and uses a single exec just to start the unit, whilst the docker provider runs each cloudinit command as a separate exec.

As an aside, I would suggest not using any references to UID, since it would not persist across a clusterctl move operation.

I think that was intentional iirc because workload clusters currently have to run on the management cluster, so if someone does a move the Nodes need to be recreated in the new management cluster anyway. In general I went with the idea that the Nodes (Pods) should never restart as that would reset the COW layer which wouldn't happen on a VM, so using the UID as the provider ID and restartPolicy: Never should ensure that once a Node joins the cluster its lifecycle is tied to the exact Pod instance that it corresponds to.

Potentially there could be some changes there to allow the kubernetes provider to manage workload clusters on remote clusters and having it so that a move keeps the existing workload cluster on the old management cluster.

dippynark on 2 Apr 2020

👍2

All 23 comments

As far as understand this provider is based on kind, so it falls on the same category of CAPD, developer tools.

Currently, CAPD is not included in the list of out-of-the-box providers, but there is documentation about how to use it in the Cluster API books under developer tools. What about using the same approach for this provider as well?
Also, should we rename this cluster-api-provider-pods? (Kubernetes on AWS -> CAPAws, Kubernetes on pods -> CAPPod)

fabriziopandini on 30 Mar 2020

sounds good, will write something up

Each Node is its own Pod so a CAPKubernetes cluster is formed of multiple Pods - I'd say CAPod would correspond with something like CAPEc2 rather than CAPAws, so I feel saying Kubernetes is the infrastructure provider makes more sense.

dippynark on 31 Mar 2020

cc @elmiko

ncdc on 1 Apr 2020

this sounds like a cool project. i am curious though, is the intent to create a more robust development/debug provider or is there an intention to have this be production grade at some point?

elmiko on 1 Apr 2020

@elmiko it started as just a good way to learn about cluster api, but I was hoping it'd be useful for something like testing controllers or distributed applications by allowing pipelines to spin up temporary clusters on existing clusters - there wasn't an intention for production grade though.

One potential future use case would be to spin up just the control plane with this provider but connect real cloud provider nodes for testing, but I don't think this type of thing is supported by cluster api.

dippynark on 1 Apr 2020

👍1

One potential future use case would be to spin up just the control plane with this provider but connect real cloud provider nodes for testing, but I don't think this type of thing is supported by cluster api.

It could be, if there is a control plane provider that's pod-based folks can use it as object reference in Cluster.ControlPlaneRef, and then infrastructure references are actual infrastructure providers

vincepri on 1 Apr 2020

Potentially that's already possible then, have only tried with Pod based worker machines.

I can't really find a good place to put this provider in the developer docs to fit in with what is there for docker, unless a separate page would be good? EIther way the docker provider would probably always be more appropriate for local development anyway.

I think the list of infrastructure providers would be a good place to put a link, not sure if that list is reserved for 'real' infrastructure providers though.

dippynark on 2 Apr 2020

I can't really find a good place to put this provider in the developer docs to fit in with what is there for docker, unless a separate page would be good?

i am trying to sync up with folks about re-doing the docker provider page and perhaps it would fit best as a parallel document?

i am still working on the pull request, but i am proposing removing docker from the quickstart guide and creating a single document instruction in the developer section for docker. it might be nice to create a section that would fit this provider as well. i would like to make it clear to our users where docker fits in the ecosystem, and by extension kubernetes sounds like it would fit this as well.

my one hesitation, is that we currently use the docker provider for testing and @chuckha has created an issue to focus on improving the docker provider ( #2738 ). i wouldn't want to create a mixed message for users, but it would be nice to have a section for these more experimental type providers if only as examples for others.

one final point, i see some tests in the kubernetes provider repo, we would need to make sure that these continue to get run as part of our automation to make it easier for future maintenance.

edit:

there is also this proposal to improve the doc structure in general, #2121

elmiko on 2 Apr 2020

Makes sense and especially with confusion around mixed messages for users as I don't think the Kubernetes provider would be great for testing compared to using the docker one.

I can hold off on any changes related to this issue then until the ones you linked are merged - if that new docker provider section gets created I can potentially copy the format to create a parallel one for the Kubernetes provider.

The e2e tests are using a recent version of the testing framework and all run locally on kind so I don't think they'll be any issues running them as part of automation (although the tests are fairly simple atm).

dippynark on 2 Apr 2020

👍1

I really like the simplicity of the service-based load balancer approach taken with your provider. I'm wondering if it would make sense to merge the different approaches taken between CAPD and this provider.

As an aside, I would suggest not using any references to UID, since it would not persist across a clusterctl move operation.

detiber on 2 Apr 2020

I'm wondering if it would make sense to merge the different approaches taken between CAPD and this provider.

that's an interesting thought. just so i'm following, this would mean using kind to create a kubernetes cluster then using the kubernetes-provider to spawn clusters within that kind cluster? (not very different from what we do now)

elmiko on 2 Apr 2020

I'm wondering if it would make sense to merge the different approaches taken between CAPD and this provider.

that's an interesting thought. just so i'm following, this would mean using kind to create a kubernetes cluster then using the kubernetes-provider to spawn clusters within that kind cluster? (not very different from what we do now)

Exactly. At least on the cluster infrastructure side the kubernetes-provider offers a great deal of simplicity compared to what we are doing in CAPD today. I'd have to do a more in-depth comparison between the two on the machine infrastructure side, but I would guess they aren't too different outside of the libraries and scaffolding they are using to do the instance creation and bootstrapping.

detiber on 2 Apr 2020

makes sense, thanks for the explanation!

elmiko on 2 Apr 2020

As an aside, I would suggest not using any references to UID, since it would not persist across a clusterctl move operation.

dippynark on 2 Apr 2020

👍2

One interesting point about this provider is the possibility to span a test cluster on many nodes.
I second @detiber comments that it will be interesting to explore a possible convergence CAPD/Kubernetes provider, but supporting move is a hard-requirement IMO

fabriziopandini on 20 Apr 2020

@fabriziopandini I guess I'm just not too sure what a move implementation would look like since currently the controller assumes we're deploying to the local cluster.

The only thing I can think of would be for the KubernetesCluster resource to trigger the creation of a ServiceAccount with enough permissions to manage the local cluster. A corresponding token, CA cert and external apiserver IP (maybe guessed from kubectl get endpoints kubernetes -o yaml or operator specified) would give a kubeconfig which can be moved with a clusterctl move so that it can continue management.

I can't see many use cases where this would be valuable though and it would add quite a lot of complexity and in some cases would not work (e.g. if source and target cluster can't reach eachother), although I guess it's more a case of making this provider fully compatible with cluster-api..

dippynark on 21 Apr 2020

I can't see many use cases where this would be valuable

@dippynark supporting move is part of a common workflow (from bootstrap to a self-hosted cluster) and we should ensure this is tested by our E2E. So this is a requirement if we want to merge CAPD and the Kubernetes provider

the controller assumes we're deploying to the local cluster.

I think that as other providers do, this provider should accept some parameters specifying where the infrastructure lives. In this case, it should be a config map with the kubeconfig for the hosting cluster (and eventually resolve to localhost if this config map exists).

it would add quite a lot of complexity

I hope the change can be scoped to the connections step...

in some cases would not work (e.g. if source and target cluster can't reach each other)

I guess this is fine. it applyes to all the other provider as well

fabriziopandini on 21 Apr 2020

@fabriziopandini a user provided one sounds a lot nicer, would it be a problem if the default behaviour remained as the current behaviour where a move would result in cluster recreation as all the local Nodes/Pods have disappeared?

And then if a secret ref is provided to a kubeconfig it is used instead of the in-cluster config and it would succeed if the source apiserver is still reachable?

I guess I can find some inspiration in the cluster-api core code for watching with multiple clients.

dippynark on 21 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 20 Jul 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 19 Aug 2020

/remove-lifecycle stale

fabriziopandini on 20 Aug 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 19 Sep 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.