What steps did you take and what happened:
If I create only a cluster object cluster-api will error-loop forever when it's not really an error case. The error happens because during reconcileKubeconfig my infrastructure is all ready to go, we have an API endpoint (the load balancer) and at this point cluster-api expects that if the infrastructure is ready, the k8s-cluster must also be ready. However i've only created a Cluster object.
What did you expect to happen:
I do not expect an infinite error reconcile pattern during a valid scenario.
I'm assuming creating just a Cluster is a valid scenario.
Anything else you would like to add:
If we cannot find the cluster CA as a secret we should not return an error but perhaps issue a warning and maybe a re-reconcile?
E0927 15:40:50.580349 1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="Secret \"my-cluster-ca\" not found" "controller"="cluster" "request"={"Namespace":"default","Name":"my-cluster"}
/kind bug
/priority longterm-important
@chuckha: The label(s) priority/longterm-important cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other
In response to this:
What steps did you take and what happened:
If I create only a cluster object cluster-api will error-loop forever when it's not really an error case. The error happens because duringreconcileKubeconfigmy infrastructure is all ready to go, we have an API endpoint (the load balancer) and at this point cluster-api expects that if the infrastructure is ready, the k8s-cluster must also be ready. However i've only created aClusterobject.What did you expect to happen:
I do not expect an infinite error reconcile pattern during a valid scenario.I'm assuming creating just a Cluster is a valid scenario.
Anything else you would like to add:
If we cannot find the cluster CA as a secret we should not return an error but perhaps issue a warning and maybe a re-reconcile?E0927 15:40:50.580349 1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="Secret \"my-cluster-ca\" not found" "controller"="cluster" "request"={"Namespace":"default","Name":"my-cluster"}/kind bug
/priority longterm-important
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/priority important-longterm
/assign
maybe we should call reconcileKubeconfig only if controlplane machines is present in the cluster?
cluster controller watches for controlplane machines, so cluster will be re-queued after deploying the controlplane machine and eventually kubeconfig secrets will be created
Does having a Cluster without any Machines a valid case though? It'll never become a Kubernetes Cluster or be usable.
@vincepri If we consider externally managed controlplanes or managed "Node Pools" backed by scale groups, then it might potentially be a valild use case, since the resources wouldn't be backed by individual Machines.
The kubeconfig will be created by others means in that scenario though, which wouldn't cause the behavior reported above
@chuckha @tahsinrahman Given that a Cluster without any other resource can't become a Kubernetes cluster, I'd like to close this issue and leave things as they are.
The use case is I want to provision the cluster infrastructure but I'm not ready to get my machines up and running, i want to see what cluster-infrastructure has done. The expectation is that at some point I will be creating machines that are attached to this cluster.
There really shouldn't be core Cluster objects sitting around that aren't going to turn into k8s clusters at some point.
That should be valid, apart from the fact that the reconciler will keep retrying?
right, the issue is about the fact that the reconciler should not be returning an error because it's not an error state.
The question I have is if this is an actual issue though (that requires a fix), we definitely want to requeue because we have no way to tell when the certificates are going to show up and the use case you provided seems related to testing, which doesn't fall in the 80% use case
yeah requeuing is definitely fine! but the fact it returns an error makes it seem like something unexpected is happening. Which, maybe it is depending on your point of view... 馃
I usually consider errors the one that go in the exponential backoff. The requeue after isn't really an error, in fact I think it doesn't get returned as such (in the main reconciler function)
We should audit all our code paths that return errors and decide if each one is actually worth returning as an error. It's important to remember that any error we do return is largely invisible to the end user/consumer. The error is logged in the pod's logs, but it isn't surfaced to the user unless we record events or update a status field on the resource in question.