Cluster-api: Creation of workload cluster machines stuck at provisioned status. Cloud-init preflight failing to fetch the ConfigMap

Created on 21 Oct 2019  路  22Comments  路  Source: kubernetes-sigs/cluster-api

/kind bug

What steps did you take and what happened:
Refiling https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/624 with the Kubeadm Bootstrap provider.

What steps did you take and what happened:

  • Successfully created management cluster.
  • Successfully created workload cluster control plane.
  • Not successful in creating workload cluster workers
  • Workload cluster machines don't progress from provisioned state to running state if there are 3 replicas.
  • Sometimes, one worker machine will progress to a running state but remaining two stay in a provisioned state.
  • No issues if only 1 replica and then scale to 3. Have not tried 2 replicas, or 4+

What did you expect to happen:
All three machines to go to a running state

Anything else you would like to add:

# kubectl get machines -o wide
NAME                                                 PROVIDERID                                       PHASE         NODENAME
keithlee-capi-mgmt-cluster-controlplane-0            vsphere://42055c30-7f18-6d98-d8a5-cd00de95fd77   running       keithlee-capi-mgmt-cluster-controlplane-0
keithlee-workload-cluster-01-controlplane-0          vsphere://42052918-be7a-9e96-0f4a-555150de0f13   running       keithlee-workload-cluster-01-controlplane-0
keithlee-workload-cluster-01-md-0-759c657695-2899z   vsphere://4205ebe7-2636-d0a4-694e-08b19de37f98   provisioned
keithlee-workload-cluster-01-md-0-759c657695-97c7c   vsphere://420577c1-10a5-ea5f-8e94-5b21a318b0bf   provisioned
keithlee-workload-cluster-01-md-0-759c657695-p2vsg   vsphere://4205288a-5565-5a1f-82a9-d67e2d89235c   provisioned

Environment:

  • capi manifest: v0.5.2-beta.0
  • clusterctl: v0.2.5
  • kind: v0.5.1
  • docker: 19.0.3
  • vSphere: 6.7U3
  • ova: ubuntu-1804-kube-v1.15.4.ova

capi.log
capv.log

cc @KeithRichardLee

arebootstrap kinbug prioritawaiting-more-evidence

All 22 comments

/area bootstrap

It sounds like this might be a kubeadm or apiserver issue at the core? (Not that we can't code around it - just trying to clarify.)

Yeah, it could vary. It could be an apiserver bug (probably not very likely, though). If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1, or maybe we add some delay somewhere. It could be a kubeadm bug (same mitigation re infra machine concurrency). If it's neither of those, then probably cabpk, in which case we'd correct it there. Definitely an interesting one to root cause!

If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1,

To my knowledge CAPV is still pretty serial, taking action only when we receive signal there's bootstrap data to process. We aren't shuffling anything into goroutines last I checked.

Yep. It's something I've considered implementing in CAPV, but so far I haven't seen a good reason for it. We'll probably add it sooner than later and just default the option to one.

I'm wondering if this could be related to the bootstrap token timing out, which version of CABPK is this against?

CABPK v0.1.2+ contains the bootstrap token refresh code (https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/250, https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/267).

Tagging @yastij as an assignee from CAPV's side since he was on the original issue.

/assign @yastij

from what @akutz saw we're still in 0.1.0, I'll try bumping CABPK and see If I can reproduce

/priority awaiting-more-evidence

Let me know if there is anything ye wish for me to test as I can consistently reproduce this.

@KeithRichardLee thanks! Can you change your CABPK deployment to v0.1.4 and test with a new cluster?

FWIW, a change just merged to CAPV master that includes CAPI v0.2.6 and CABPK v0.1.4. To generate manifests for that version you can use the following command:

docker run --rm \
  -v "$(pwd)":/out \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  --help

Additionally, CAPV v0.5.2-beta.1 is being tagged tomorrow morning and will also have this change, with CAPV v0.5.2 scheduled for release this Friday.

FWIW, a change just merged to CAPV master that includes CAPI v0.2.6 and CABPK v0.1.4. To generate manifests for that version you can use the following command:

docker run --rm \
  -v "$(pwd)":/out \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  --help

Giving this a spin now...

Please note you'll still need to follow the CAPV Getting started guide and provide the env vars file. I just removed it from the above CLI example as it would prevent the printing of the online help without the envvars.txt file present in the working directory. The full example would be:

docker run --rm \
  -v "$(pwd)":/out \
  -v "$(pwd)/envvars.txt":/envvars.txt:ro \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  -c management-cluster

First run using v0.5.2-beta.0-32-gb0cacda1 was a success. All three machines progressed to a running state. Will do another two runs.

3 successful runs now complete!

Same here:

  • 1 -> 3 replicas completed
  • 3 replicas completed

Runs: 8

I think we can close this

/close

@detiber: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings