What steps did you take and what happened:
After following the instructions in the quick start guide and using the Machine object definition defined under the usage section for vSphere the VM is created successfully but the capi-controller-manager logs clearly show that it is unable to retrieve this secret
root@cli-vm:~# k logs capi-controller-manager-6c64c695bb-gwkj5 -n capi-system
E0220 15:36:11.664898 1 controller.go:218] controller-runtime/controller "msg"="Reconciler error"
"error"="failed to retrieve kubeconfig secret for Cluster \"capi-quickstart\" in namespace \"default\": Secret
\"capi-quickstart-kubeconfig\" not found" "controller"="machine" "request"=
{"Namespace":"default","Name":"capi-quickstart-controlplane-0"}
From the capv-controller-manager logs the error message says that the KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found" which is very strange as the object is created successfully.
k logs -n capv-system capv-controller-manager-88f646758-pr6wj
E0220 15:27:23.411515 1 controller.go:218] controller-runtime/controller "msg"="Reconciler error"
"error"="failed to reconcile API endpoints for VSphereCluster default/capi-quickstart: failed to get
KubeadmConfig capi-quickstart-controlplane-0/ for Machine default//capi-quickstart-controlplane-0:
KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found"
"controller"="vspherecluster" "request"={"Namespace":"default","Name":"capi-quickstart"}
What did you expect to happen:
Once the cluster and machine objects are created, the <cluster-name>-kubeconfig secret should also be created successfully.
Anything else you would like to add:
After looking at the error message, I could see that the quick start guide's KubeadmConfig object is missing the namespace property as expected in the kubeadmConfigKey's namespace definition machine.Spec.Bootstrap.ConfigRef.Namespace
Adding the namespace resolved the issue. I'll submit a PR to fix the docs to include the namespace but given the namespace is not a required field I expected that the kubeconfig secret would have generated successfully.
Environment:
kubectl version): 1.16.3/etc/os-release): Ubuntu/kind bug
/kind documentation
Are you testing with v1alpha2 or v1alpha3 (master)?
@ncdc this is v1alpha2 as per the quick start guide.
Ok, the reason I was asking is because your PR is fixing master, which is for v1alpha3.
Things should be working without this change. Maybe you could post logs from all your controllers?
Ah, I missed the v1alpha3 is pointing to the master part. I can share the logs as this issue is easily reproducible in the environment I'm working on. Adding the namespace fixes it.
Log files + cluster and machine yaml capi-issue-2403.zip
Log files
root@cli-vm:~# k logs capv-controller-manager-88f646758-pr6wj -n capv-system > capv-controller-manager.log
root@cli-vm:~# k logs -n capi-system capi-controller-manager-6c64c695bb-jh4xj > capi-controller-manager.log
root@cli-vm:~# k logs -n cabpk-system cabpk-controller-manager-c58d8596f-bpfqd -c manager > cabpk-controller-manager.log
root@cli-vm:~# k logs -n cabpk-system cabpk-controller-manager-c58d8596f-bpfqd -c kube-rbac-proxy > cabpk-controller-rbac-proxy.log
Machine provisioned successfully
root@cli-vm:~# k get machine
NAME PROVIDERID PHASE
capi-quickstart-controlplane-0 vsphere://4225f8f6-95a0-5f66-82fe-7912a24232ad provisioned
kubeconfig secret not generated
root@cli-vm:~# k get secrets
NAME TYPE DATA AGE
capi-quickstart-ca Opaque 2 3m27s
capi-quickstart-etcd Opaque 2 3m27s
capi-quickstart-proxy Opaque 2 3m27s
capi-quickstart-sa Opaque 2 3m27s
default-token-wnfhn kubernetes.io/service-account-token 3 3d9h
KubeadmConfig object is present
root@cli-vm:~# k get kubeadmconfig
NAME AGE
capi-quickstart-controlplane-0 3m36s
OK, here is what I believe is happening:
From the capv-controller-manager logs the error message says that the KubeadmConfig.bootstrap.cluster.x-k8s.io \"capi-quickstart-controlplane-0\" not found" which is very strange as the object is created successfully.
This is a temporary "error" that is logged because the KubeadmConfig is not in the controller's cache for a period of time. It eventually goes away. Safe to ignore.
controller-runtime/controller "msg"="Reconciler error" "error"="failed to retrieve kubeconfig secret for Cluster \"capi-quickstart\" in namespace \"default\": Secret \"capi-quickstart-kubeconfig\" not found" "controller"="machine" "request"= {"Namespace":"default","Name":"capi-quickstart-controlplane-0"}
Note the case of kubeconfig in failed to retrieve kubeconfig secret - it's all lowercase. This comes from remote.NewClusterClient(). This function is used in CAPI in 4 places:
Of these 4, I'm reasonably certain we're dealing with the reconcileNodeRef case. We can't reconcile the node ref until we have a kubeconfig secret for the workload cluster. And we can't create that secret until cluster.status.apiEndpoints is set. For CAPV, the API endpoints are set either from an annotation on the VSphereCluster, or after the first control plane machine is running. You may be running into a situation where you need to wait a bit longer for the kubeconfig secret to be created (once the API endpoints are set).
Thanks for the detailed explanation. IIRC, I've waited for more than an hour when this happened the last time but I have kicked off another test just to make sure I waited long enough 馃槃 . Also, as I mentioned earlier adding a namespace to the YAML(as per the PR) immediately creates the secret, wondering what could be different in that case 馃
Most, if not all, of our controllers resync every 10 minutes, so if it takes more than ~15 minutes and you aren't seeing action, you can assume something's broken (so you don't have to wait an hour).
When you have the namespace in there, you're doing the same workflow, right? You're not creating everything, then adding the namespace after & doing another kubectl apply?
FYI here's the code the does nothing until cluster.status.apiEndpoints is filled in: https://github.com/kubernetes-sigs/cluster-api/blob/fd95764184347e36c296a43df0b3cd740cb50947/controllers/cluster_controller_phases.go#L179-L199
If you want to collect two sets of logs - 1 without namespace, 1 with - maybe something will stand out.
Yes, it's the same workflow. I have two sets of machine.yaml one with namespace and one without. So the flow looks like
kubectl apply -f cluster.yaml
kubectl apply -f machine.yaml
Also, the other test I kicked off has now been in the same state for >15 minutes
root@cli-vm:~# k get secrets
NAME TYPE DATA AGE
capi-quickstart-ca Opaque 2 29m
capi-quickstart-etcd Opaque 2 29m
capi-quickstart-proxy Opaque 2 29m
capi-quickstart-sa Opaque 2 29m
default-token-wnfhn kubernetes.io/service-account-token 3 3d11h
I will attach the logs where the secret generation is successful with the namespace.
Ok for the one that's hanging, can you kubectl describe AND kubectl get ... -o json the cluster, machine, kubeadmconfig, vspherecluster, and vspheremachine?
Here are all the artifacts for both with and without the namespace
without_namespace.zip
with_namespace.zip
Here is the steps I follow and the diff between the YAML. This shows that the capi-quickstart-kubeconfig secret is created in ~2-3 mins in case the YAML has a namespace.
root@cli-vm:~# k apply -f cluster.yaml
cluster.cluster.x-k8s.io/capi-quickstart created
vspherecluster.infrastructure.cluster.x-k8s.io/capi-quickstart created
root@cli-vm:~# k get clusters
NAME PHASE
capi-quickstart provisioned
root@cli-vm:~# k apply -f machine.yaml
machine.cluster.x-k8s.io/capi-quickstart-controlplane-0 created
vspheremachine.infrastructure.cluster.x-k8s.io/capi-quickstart-controlplane-0 created
kubeadmconfig.bootstrap.cluster.x-k8s.io/capi-quickstart-controlplane-0 created
root@cli-vm:~# date
Mon Feb 24 18:52:37 UTC 2020
root@cli-vm:~# k get machine
NAME PROVIDERID PHASE
capi-quickstart-controlplane-0 provisioning
root@cli-vm:~# k get machine
NAME PROVIDERID PHASE
capi-quickstart-controlplane-0 provisioning
root@cli-vm:~# k get secrets
NAME TYPE DATA AGE
capi-quickstart-ca Opaque 2 64s
capi-quickstart-etcd Opaque 2 64s
capi-quickstart-proxy Opaque 2 64s
capi-quickstart-sa Opaque 2 64s
default-token-wnfhn kubernetes.io/service-account-token 3 3d12h
root@cli-vm:~# date
Mon Feb 24 18:53:45 UTC 2020
root@cli-vm:~# k get secrets
NAME TYPE DATA AGE
capi-quickstart-ca Opaque 2 2m3s
capi-quickstart-etcd Opaque 2 2m3s
capi-quickstart-kubeconfig Opaque 1 18s
capi-quickstart-proxy Opaque 2 2m3s
capi-quickstart-sa Opaque 2 2m3s
default-token-wnfhn kubernetes.io/service-account-token 3 3d12h
root@cli-vm:~# k get machine
NAME PROVIDERID PHASE
capi-quickstart-controlplane-0 vsphere://4225813c-abfc-edbd-7238-4a64f022b1ac provisioned
root@cli-vm:~# diff -u machine.yaml machine_no_ns.yaml
--- machine.yaml 2020-02-21 23:03:15.705103428 +0000
+++ machine_no_ns.yaml 2020-02-21 23:27:32.473861681 +0000
@@ -13,7 +13,6 @@
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha2
kind: KubeadmConfig
name: capi-quickstart-controlplane-0
- namespace: default
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
Do you have the cluster json & describe?
Ok, this is a code issue in CAPV:
That really should be machine.Namespace. That is why adding namespace to machine.spec.bootstrap.configRef fixes the problem. @detiber @yastij @randomvariable do you want to fix this in CAPV?
Good to know. I can take a stab at fixing this in CAPV if that is alright. I can create an issue there and continue this discussion as I would like to understand the difference as well. In the meanwhile does temporarily fixing the docs with PR #2407 sounds good?
I鈥檇 rather not, as that makes it so the QuickStart only works in the default namespace. Let鈥檚 fix CAPV to resolve this issue.
Noted. https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/769 has been added to our milestone for the next RC release on Wed 26, and will be backported.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Closing given that the CAPV issue has been closed, feel free to reopen if necessary.
/close
@vincepri: Closing this issue.
In response to this:
Closing given that the CAPV issue has been closed, feel free to reopen if necessary.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.