Cluster-api: Creating control plane machine following the quick start guide fails with "failed to retrieve kubeconfig secret"

Created on 22 Feb 2020 · 21Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
After following the instructions in the quick start guide and using the Machine object definition defined under the usage section for vSphere the VM is created successfully but the -kubeconfig secret is not created. On further troubleshooting, the capi-controller-manager logs clearly show that it is unable to retrieve this secret

root@cli-vm:~# k logs capi-controller-manager-6c64c695bb-gwkj5 -n capi-system

E0220 15:36:11.664898       1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" 
"error"="failed to retrieve kubeconfig secret for Cluster \"capi-quickstart\" in namespace \"default\": Secret
 \"capi-quickstart-kubeconfig\" not found"  "controller"="machine" "request"=
{"Namespace":"default","Name":"capi-quickstart-controlplane-0"}

From the capv-controller-manager logs the error message says that the KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found" which is very strange as the object is created successfully.

k logs -n capv-system capv-controller-manager-88f646758-pr6wj

E0220 15:27:23.411515       1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" 
"error"="failed to reconcile API endpoints for VSphereCluster default/capi-quickstart: failed to get 
KubeadmConfig capi-quickstart-controlplane-0/ for Machine default//capi-quickstart-controlplane-0: 
KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found"  
"controller"="vspherecluster" "request"={"Namespace":"default","Name":"capi-quickstart"}

What did you expect to happen:
Once the cluster and machine objects are created, the <cluster-name>-kubeconfig secret should also be created successfully.

Anything else you would like to add:
After looking at the error message, I could see that the quick start guide's KubeadmConfig object is missing the namespace property as expected in the kubeadmConfigKey's namespace definition machine.Spec.Bootstrap.ConfigRef.Namespace

Adding the namespace resolved the issue. I'll submit a PR to fix the docs to include the namespace but given the namespace is not a required field I expected that the kubeconfig secret would have generated successfully.

Environment:

Cluster-api version: v1alpha2
KIND version: 0.7.0
Kubernetes version: (use kubectl version): 1.16.3
OS (e.g. from /etc/os-release): Ubuntu

/kind bug
/kind documentation

kinbug kindocumentation lifecyclrotten

Source

outofmem0ry

All 21 comments

Are you testing with v1alpha2 or v1alpha3 (master)?

ncdc on 24 Feb 2020

@ncdc this is v1alpha2 as per the quick start guide.

outofmem0ry on 24 Feb 2020

Ok, the reason I was asking is because your PR is fixing master, which is for v1alpha3.

Things should be working without this change. Maybe you could post logs from all your controllers?

ncdc on 24 Feb 2020

Ah, I missed the v1alpha3 is pointing to the master part. I can share the logs as this issue is easily reproducible in the environment I'm working on. Adding the namespace fixes it.

outofmem0ry on 24 Feb 2020

Log files + cluster and machine yaml capi-issue-2403.zip

Log files

root@cli-vm:~# k logs capv-controller-manager-88f646758-pr6wj -n capv-system > capv-controller-manager.log
root@cli-vm:~# k logs -n capi-system capi-controller-manager-6c64c695bb-jh4xj >  capi-controller-manager.log
root@cli-vm:~# k logs -n cabpk-system cabpk-controller-manager-c58d8596f-bpfqd -c manager >  cabpk-controller-manager.log
root@cli-vm:~# k logs -n cabpk-system cabpk-controller-manager-c58d8596f-bpfqd -c kube-rbac-proxy >  cabpk-controller-rbac-proxy.log

Machine provisioned successfully

root@cli-vm:~# k get machine
NAME                             PROVIDERID                                       PHASE
capi-quickstart-controlplane-0   vsphere://4225f8f6-95a0-5f66-82fe-7912a24232ad   provisioned

kubeconfig secret not generated

root@cli-vm:~# k get secrets
NAME                    TYPE                                  DATA   AGE
capi-quickstart-ca      Opaque                                2      3m27s
capi-quickstart-etcd    Opaque                                2      3m27s
capi-quickstart-proxy   Opaque                                2      3m27s
capi-quickstart-sa      Opaque                                2      3m27s
default-token-wnfhn     kubernetes.io/service-account-token   3      3d9h

KubeadmConfig object is present

root@cli-vm:~# k get kubeadmconfig
NAME                             AGE
capi-quickstart-controlplane-0   3m36s

outofmem0ry on 24 Feb 2020

OK, here is what I believe is happening:

From the capv-controller-manager logs the error message says that the KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found" which is very strange as the object is created successfully.

This is a temporary "error" that is logged because the KubeadmConfig is not in the controller's cache for a period of time. It eventually goes away. Safe to ignore.

controller-runtime/controller "msg"="Reconciler error" "error"="failed to retrieve kubeconfig secret for Cluster \"capi-quickstart\" in namespace \"default\": Secret \"capi-quickstart-kubeconfig\" not found" "controller"="machine" "request"= {"Namespace":"default","Name":"capi-quickstart-controlplane-0"}

Note the case of kubeconfig in failed to retrieve kubeconfig secret - it's all lowercase. This comes from remote.NewClusterClient(). This function is used in CAPI in 4 places:

machine_controller.go#deleteNode
machine_controller.go#drainNode
machine_controller_noderef.go#reconcileNodeRef
machineset_status.go#getMachineNode

Of these 4, I'm reasonably certain we're dealing with the reconcileNodeRef case. We can't reconcile the node ref until we have a kubeconfig secret for the workload cluster. And we can't create that secret until cluster.status.apiEndpoints is set. For CAPV, the API endpoints are set either from an annotation on the VSphereCluster, or after the first control plane machine is running. You may be running into a situation where you need to wait a bit longer for the kubeconfig secret to be created (once the API endpoints are set).

ncdc on 24 Feb 2020

Thanks for the detailed explanation. IIRC, I've waited for more than an hour when this happened the last time but I have kicked off another test just to make sure I waited long enough 😄 . Also, as I mentioned earlier adding a namespace to the YAML(as per the PR) immediately creates the secret, wondering what could be different in that case 🤔

outofmem0ry on 24 Feb 2020

Most, if not all, of our controllers resync every 10 minutes, so if it takes more than ~15 minutes and you aren't seeing action, you can assume something's broken (so you don't have to wait an hour).

When you have the namespace in there, you're doing the same workflow, right? You're not creating everything, then adding the namespace after & doing another kubectl apply?

ncdc on 24 Feb 2020

FYI here's the code the does nothing until cluster.status.apiEndpoints is filled in: https://github.com/kubernetes-sigs/cluster-api/blob/fd95764184347e36c296a43df0b3cd740cb50947/controllers/cluster_controller_phases.go#L179-L199

If you want to collect two sets of logs - 1 without namespace, 1 with - maybe something will stand out.

ncdc on 24 Feb 2020

Yes, it's the same workflow. I have two sets of machine.yaml one with namespace and one without. So the flow looks like

kubectl apply -f cluster.yaml
kubectl apply -f machine.yaml

Also, the other test I kicked off has now been in the same state for >15 minutes

root@cli-vm:~# k get secrets
NAME                    TYPE                                  DATA   AGE
capi-quickstart-ca      Opaque                                2      29m
capi-quickstart-etcd    Opaque                                2      29m
capi-quickstart-proxy   Opaque                                2      29m
capi-quickstart-sa      Opaque                                2      29m
default-token-wnfhn     kubernetes.io/service-account-token   3      3d11h

I will attach the logs where the secret generation is successful with the namespace.

outofmem0ry on 24 Feb 2020

Ok for the one that's hanging, can you kubectl describe AND kubectl get ... -o json the cluster, machine, kubeadmconfig, vspherecluster, and vspheremachine?

ncdc on 24 Feb 2020

Here are all the artifacts for both with and without the namespace
without_namespace.zip
with_namespace.zip

Here is the steps I follow and the diff between the YAML. This shows that the capi-quickstart-kubeconfig secret is created in ~2-3 mins in case the YAML has a namespace.

root@cli-vm:~# k apply -f cluster.yaml
cluster.cluster.x-k8s.io/capi-quickstart created
vspherecluster.infrastructure.cluster.x-k8s.io/capi-quickstart created

root@cli-vm:~# k get clusters
NAME              PHASE
capi-quickstart   provisioned

root@cli-vm:~# k apply -f machine.yaml
machine.cluster.x-k8s.io/capi-quickstart-controlplane-0 created
vspheremachine.infrastructure.cluster.x-k8s.io/capi-quickstart-controlplane-0 created
kubeadmconfig.bootstrap.cluster.x-k8s.io/capi-quickstart-controlplane-0 created

root@cli-vm:~# date
Mon Feb 24 18:52:37 UTC 2020

root@cli-vm:~# k get machine
NAME                             PROVIDERID   PHASE
capi-quickstart-controlplane-0                provisioning
root@cli-vm:~# k get machine
NAME                             PROVIDERID   PHASE
capi-quickstart-controlplane-0                provisioning
root@cli-vm:~# k get secrets
NAME                    TYPE                                  DATA   AGE
capi-quickstart-ca      Opaque                                2      64s
capi-quickstart-etcd    Opaque                                2      64s
capi-quickstart-proxy   Opaque                                2      64s
capi-quickstart-sa      Opaque                                2      64s
default-token-wnfhn     kubernetes.io/service-account-token   3      3d12h

root@cli-vm:~# date
Mon Feb 24 18:53:45 UTC 2020

root@cli-vm:~# k get secrets
NAME                         TYPE                                  DATA   AGE
capi-quickstart-ca           Opaque                                2      2m3s
capi-quickstart-etcd         Opaque                                2      2m3s
capi-quickstart-kubeconfig   Opaque                                1      18s
capi-quickstart-proxy        Opaque                                2      2m3s
capi-quickstart-sa           Opaque                                2      2m3s
default-token-wnfhn          kubernetes.io/service-account-token   3      3d12h

root@cli-vm:~# k get machine
NAME                             PROVIDERID                                       PHASE
capi-quickstart-controlplane-0   vsphere://4225813c-abfc-edbd-7238-4a64f022b1ac   provisioned

root@cli-vm:~# diff -u machine.yaml machine_no_ns.yaml
--- machine.yaml        2020-02-21 23:03:15.705103428 +0000
+++ machine_no_ns.yaml  2020-02-21 23:27:32.473861681 +0000
@@ -13,7 +13,6 @@
       apiVersion: bootstrap.cluster.x-k8s.io/v1alpha2
       kind: KubeadmConfig
       name: capi-quickstart-controlplane-0
-      namespace: default
   infrastructureRef:
     apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
     kind: VSphereMachine

outofmem0ry on 24 Feb 2020

Do you have the cluster json & describe?

ncdc on 24 Feb 2020

Ok, this is a code issue in CAPV:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/6b3cbc7b603ac0e79a7384106d8390329580eac8/pkg/util/kubeadm.go#L43

That really should be machine.Namespace. That is why adding namespace to machine.spec.bootstrap.configRef fixes the problem. @detiber @yastij @randomvariable do you want to fix this in CAPV?

ncdc on 24 Feb 2020

Good to know. I can take a stab at fixing this in CAPV if that is alright. I can create an issue there and continue this discussion as I would like to understand the difference as well. In the meanwhile does temporarily fixing the docs with PR #2407 sounds good?

outofmem0ry on 24 Feb 2020

I’d rather not, as that makes it so the QuickStart only works in the default namespace. Let’s fix CAPV to resolve this issue.

ncdc on 24 Feb 2020

👍1

Noted. https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/769 has been added to our milestone for the next RC release on Wed 26, and will be backported.

randomvariable on 24 Feb 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 25 May 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 24 Jun 2020

Closing given that the CAPV issue has been closed, feel free to reopen if necessary.

/close

vincepri on 24 Jun 2020

👍1

@vincepri: Closing this issue.

In response to this:

Closing given that the CAPV issue has been closed, feel free to reopen if necessary.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.