Cluster-api: Creation of workload cluster machines stuck at provisioned status. Cloud-init preflight failing to fetch the ConfigMap

Created on 21 Oct 2019 · 22Comments · Source: kubernetes-sigs/cluster-api

/kind bug

What steps did you take and what happened:
Refiling https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/624 with the Kubeadm Bootstrap provider.

What steps did you take and what happened:

Successfully created management cluster.
Successfully created workload cluster control plane.
Not successful in creating workload cluster workers
Workload cluster machines don't progress from provisioned state to running state if there are 3 replicas.
Sometimes, one worker machine will progress to a running state but remaining two stay in a provisioned state.
No issues if only 1 replica and then scale to 3. Have not tried 2 replicas, or 4+

What did you expect to happen:
All three machines to go to a running state

Anything else you would like to add:

# kubectl get machines -o wide
NAME                                                 PROVIDERID                                       PHASE         NODENAME
keithlee-capi-mgmt-cluster-controlplane-0            vsphere://42055c30-7f18-6d98-d8a5-cd00de95fd77   running       keithlee-capi-mgmt-cluster-controlplane-0
keithlee-workload-cluster-01-controlplane-0          vsphere://42052918-be7a-9e96-0f4a-555150de0f13   running       keithlee-workload-cluster-01-controlplane-0
keithlee-workload-cluster-01-md-0-759c657695-2899z   vsphere://4205ebe7-2636-d0a4-694e-08b19de37f98   provisioned
keithlee-workload-cluster-01-md-0-759c657695-97c7c   vsphere://420577c1-10a5-ea5f-8e94-5b21a318b0bf   provisioned
keithlee-workload-cluster-01-md-0-759c657695-p2vsg   vsphere://4205288a-5565-5a1f-82a9-d67e2d89235c   provisioned

Environment:

capi manifest: v0.5.2-beta.0
clusterctl: v0.2.5
kind: v0.5.1
docker: 19.0.3
vSphere: 6.7U3
ova: ubuntu-1804-kube-v1.15.4.ova

capi.log
capv.log

cc @KeithRichardLee

arebootstrap kinbug prioritawaiting-more-evidence

Source

akutz

👍1

All 22 comments

/area bootstrap

akutz on 21 Oct 2019

It sounds like this might be a kubeadm or apiserver issue at the core? (Not that we can't code around it - just trying to clarify.)

ncdc on 21 Oct 2019

Yeah, it could vary. It could be an apiserver bug (probably not very likely, though). If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1, or maybe we add some delay somewhere. It could be a kubeadm bug (same mitigation re infra machine concurrency). If it's neither of those, then probably cabpk, in which case we'd correct it there. Definitely an interesting one to root cause!

ncdc on 21 Oct 2019

If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1,

To my knowledge CAPV is still pretty serial, taking action only when we receive signal there's bootstrap data to process. We aren't shuffling anything into goroutines last I checked.

akutz on 21 Oct 2019

Gotcha. Some providers allow the user to control the number of simultaneous workers per controller. For example: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/1728e1d752e84ec1dbd3a525c436d0181588e1f9/main.go#L102-L106 and https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/1728e1d752e84ec1dbd3a525c436d0181588e1f9/main.go#L162.

ncdc on 21 Oct 2019

Yep. It's something I've considered implementing in CAPV, but so far I haven't seen a good reason for it. We'll probably add it sooner than later and just default the option to one.

akutz on 21 Oct 2019

I'm wondering if this could be related to the bootstrap token timing out, which version of CABPK is this against?

detiber on 21 Oct 2019

👍1

CABPK v0.1.2+ contains the bootstrap token refresh code (https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/250, https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/267).

ncdc on 21 Oct 2019

Tagging @yastij as an assignee from CAPV's side since he was on the original issue.

/assign @yastij

akutz on 21 Oct 2019

from what @akutz saw we're still in 0.1.0, I'll try bumping CABPK and see If I can reproduce

yastij on 21 Oct 2019

/priority awaiting-more-evidence

vincepri on 21 Oct 2019

Let me know if there is anything ye wish for me to test as I can consistently reproduce this.

KeithRichardLee on 21 Oct 2019

@KeithRichardLee thanks! Can you change your CABPK deployment to v0.1.4 and test with a new cluster?

ncdc on 21 Oct 2019

FWIW, a change just merged to CAPV master that includes CAPI v0.2.6 and CABPK v0.1.4. To generate manifests for that version you can use the following command:

docker run --rm \
  -v "$(pwd)":/out \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  --help

akutz on 21 Oct 2019

Additionally, CAPV v0.5.2-beta.1 is being tagged tomorrow morning and will also have this change, with CAPV v0.5.2 scheduled for release this Friday.

akutz on 21 Oct 2019

FWIW, a change just merged to CAPV master that includes CAPI v0.2.6 and CABPK v0.1.4. To generate manifests for that version you can use the following command:
docker run --rm \
  -v "$(pwd)":/out \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  --help

Giving this a spin now...

KeithRichardLee on 21 Oct 2019

Please note you'll still need to follow the CAPV Getting started guide and provide the env vars file. I just removed it from the above CLI example as it would prevent the printing of the online help without the envvars.txt file present in the working directory. The full example would be:

docker run --rm \
  -v "$(pwd)":/out \
  -v "$(pwd)/envvars.txt":/envvars.txt:ro \
  gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
  -c management-cluster

akutz on 21 Oct 2019

First run using v0.5.2-beta.0-32-gb0cacda1 was a success. All three machines progressed to a running state. Will do another two runs.

KeithRichardLee on 22 Oct 2019

🎉1

3 successful runs now complete!

KeithRichardLee on 22 Oct 2019

🎉1

Same here:

1 -> 3 replicas completed
3 replicas completed

Runs: 8

I think we can close this

yastij on 22 Oct 2019

/close

detiber on 22 Oct 2019

@detiber: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.