/kind bug
What steps did you take and what happened:
Refiling https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/624 with the Kubeadm Bootstrap provider.
What steps did you take and what happened:
What did you expect to happen:
All three machines to go to a running state
Anything else you would like to add:
# kubectl get machines -o wide
NAME PROVIDERID PHASE NODENAME
keithlee-capi-mgmt-cluster-controlplane-0 vsphere://42055c30-7f18-6d98-d8a5-cd00de95fd77 running keithlee-capi-mgmt-cluster-controlplane-0
keithlee-workload-cluster-01-controlplane-0 vsphere://42052918-be7a-9e96-0f4a-555150de0f13 running keithlee-workload-cluster-01-controlplane-0
keithlee-workload-cluster-01-md-0-759c657695-2899z vsphere://4205ebe7-2636-d0a4-694e-08b19de37f98 provisioned
keithlee-workload-cluster-01-md-0-759c657695-97c7c vsphere://420577c1-10a5-ea5f-8e94-5b21a318b0bf provisioned
keithlee-workload-cluster-01-md-0-759c657695-p2vsg vsphere://4205288a-5565-5a1f-82a9-d67e2d89235c provisioned
Environment:
cc @KeithRichardLee
/area bootstrap
It sounds like this might be a kubeadm or apiserver issue at the core? (Not that we can't code around it - just trying to clarify.)
Yeah, it could vary. It could be an apiserver bug (probably not very likely, though). If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1, or maybe we add some delay somewhere. It could be a kubeadm bug (same mitigation re infra machine concurrency). If it's neither of those, then probably cabpk, in which case we'd correct it there. Definitely an interesting one to root cause!
If so, the mitigation is probably to make sure your infra provider's machine concurrency level is 1,
To my knowledge CAPV is still pretty serial, taking action only when we receive signal there's bootstrap data to process. We aren't shuffling anything into goroutines last I checked.
Gotcha. Some providers allow the user to control the number of simultaneous workers per controller. For example: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/1728e1d752e84ec1dbd3a525c436d0181588e1f9/main.go#L102-L106 and https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/1728e1d752e84ec1dbd3a525c436d0181588e1f9/main.go#L162.
Yep. It's something I've considered implementing in CAPV, but so far I haven't seen a good reason for it. We'll probably add it sooner than later and just default the option to one.
I'm wondering if this could be related to the bootstrap token timing out, which version of CABPK is this against?
CABPK v0.1.2+ contains the bootstrap token refresh code (https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/250, https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/pull/267).
Tagging @yastij as an assignee from CAPV's side since he was on the original issue.
/assign @yastij
from what @akutz saw we're still in 0.1.0, I'll try bumping CABPK and see If I can reproduce
/priority awaiting-more-evidence
Let me know if there is anything ye wish for me to test as I can consistently reproduce this.
@KeithRichardLee thanks! Can you change your CABPK deployment to v0.1.4 and test with a new cluster?
FWIW, a change just merged to CAPV master that includes CAPI v0.2.6 and CABPK v0.1.4. To generate manifests for that version you can use the following command:
docker run --rm \
-v "$(pwd)":/out \
gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
--help
Additionally, CAPV v0.5.2-beta.1 is being tagged tomorrow morning and will also have this change, with CAPV v0.5.2 scheduled for release this Friday.
FWIW, a change just merged to CAPV
masterthat includes CAPIv0.2.6and CABPKv0.1.4. To generate manifests for that version you can use the following command:docker run --rm \ -v "$(pwd)":/out \ gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \ --help
Giving this a spin now...
Please note you'll still need to follow the CAPV Getting started guide and provide the env vars file. I just removed it from the above CLI example as it would prevent the printing of the online help without the envvars.txt file present in the working directory. The full example would be:
docker run --rm \
-v "$(pwd)":/out \
-v "$(pwd)/envvars.txt":/envvars.txt:ro \
gcr.io/cluster-api-provider-vsphere/ci/manifests:v0.5.2-beta.0-32-gb0cacda1 \
-c management-cluster
First run using v0.5.2-beta.0-32-gb0cacda1 was a success. All three machines progressed to a running state. Will do another two runs.
3 successful runs now complete!
Same here:
Runs: 8
I think we can close this
/close
@detiber: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.