Cluster-api: Better handle the error case when kubeconfig does not exist

Created on 27 Sep 2019 · 15Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
If I create only a cluster object cluster-api will error-loop forever when it's not really an error case. The error happens because during reconcileKubeconfig my infrastructure is all ready to go, we have an API endpoint (the load balancer) and at this point cluster-api expects that if the infrastructure is ready, the k8s-cluster must also be ready. However i've only created a Cluster object.

What did you expect to happen:
I do not expect an infinite error reconcile pattern during a valid scenario.

I'm assuming creating just a Cluster is a valid scenario.

Anything else you would like to add:
If we cannot find the cluster CA as a secret we should not return an error but perhaps issue a warning and maybe a re-reconcile?

E0927 15:40:50.580349       1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="Secret \"my-cluster-ca\" not found"  "controller"="cluster" "request"={"Namespace":"default","Name":"my-cluster"}

/kind bug
/priority longterm-important

kinbug prioritimportant-longterm

Source

chuckha

All 15 comments

@chuckha: The label(s) priority/longterm-important cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

What steps did you take and what happened:
If I create only a cluster object cluster-api will error-loop forever when it's not really an error case. The error happens because during reconcileKubeconfig my infrastructure is all ready to go, we have an API endpoint (the load balancer) and at this point cluster-api expects that if the infrastructure is ready, the k8s-cluster must also be ready. However i've only created a Cluster object.

What did you expect to happen:
I do not expect an infinite error reconcile pattern during a valid scenario.

I'm assuming creating just a Cluster is a valid scenario.

Anything else you would like to add:
If we cannot find the cluster CA as a secret we should not return an error but perhaps issue a warning and maybe a re-reconcile?
E0927 15:40:50.580349       1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="Secret \"my-cluster-ca\" not found"  "controller"="cluster" "request"={"Namespace":"default","Name":"my-cluster"}
/kind bug
/priority longterm-important

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 27 Sep 2019

/priority important-longterm

detiber on 27 Sep 2019

😄1

/assign

tahsinrahman on 27 Sep 2019

maybe we should call reconcileKubeconfig only if controlplane machines is present in the cluster?
cluster controller watches for controlplane machines, so cluster will be re-queued after deploying the controlplane machine and eventually kubeconfig secrets will be created

tahsinrahman on 27 Sep 2019

Does having a Cluster without any Machines a valid case though? It'll never become a Kubernetes Cluster or be usable.

vincepri on 30 Sep 2019

@vincepri If we consider externally managed controlplanes or managed "Node Pools" backed by scale groups, then it might potentially be a valild use case, since the resources wouldn't be backed by individual Machines.

detiber on 30 Sep 2019

The kubeconfig will be created by others means in that scenario though, which wouldn't cause the behavior reported above

vincepri on 30 Sep 2019

@chuckha @tahsinrahman Given that a Cluster without any other resource can't become a Kubernetes cluster, I'd like to close this issue and leave things as they are.

vincepri on 30 Sep 2019

The use case is I want to provision the cluster infrastructure but I'm not ready to get my machines up and running, i want to see what cluster-infrastructure has done. The expectation is that at some point I will be creating machines that are attached to this cluster.

There really shouldn't be core Cluster objects sitting around that aren't going to turn into k8s clusters at some point.

chuckha on 30 Sep 2019

That should be valid, apart from the fact that the reconciler will keep retrying?

vincepri on 30 Sep 2019

right, the issue is about the fact that the reconciler should not be returning an error because it's not an error state.

chuckha on 30 Sep 2019

The question I have is if this is an actual issue though (that requires a fix), we definitely want to requeue because we have no way to tell when the certificates are going to show up and the use case you provided seems related to testing, which doesn't fall in the 80% use case

vincepri on 30 Sep 2019

yeah requeuing is definitely fine! but the fact it returns an error makes it seem like something unexpected is happening. Which, maybe it is depending on your point of view... 🤔

chuckha on 30 Sep 2019

I usually consider errors the one that go in the exponential backoff. The requeue after isn't really an error, in fact I think it doesn't get returned as such (in the main reconciler function)

vincepri on 30 Sep 2019

We should audit all our code paths that return errors and decide if each one is actually worth returning as an error. It's important to remember that any error we do return is largely invisible to the end user/consumer. The error is logged in the pod's logs, but it isn't surfaced to the user unless we record events or update a status field on the resource in question.