FEATURE REQUEST
kubeadm version: v1.17.*
While executing Cluster API tests, in some cases it was observed kubeadm join failures.
xref https://github.com/kubernetes-sigs/cluster-api/issues/2769
I would like to re-run kubeadm join and the command to complete the join process and this behavior to be guaranteed in the future versions
Most of all the phases should be already idempotent. However, this behavior should be verified (especially for add new etcd member phase)
NB. it is not required that preflight checks are idempotent
/assign
I've encountered this and the only step that caused problems for me was etcd. If I aborted a prior attempt while etcd was waiting to connect, it connects, then go to re-run kubeadm join I get:
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
...
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: Peer URLs already exists
To see the stack trace of this error execute with --v=5 or higher
Failed to join the kubernetes cluster.
I've worked around this by manually running the final steps:
kubeadm join phase control-plane-join update-status --config kubeadm.conf
kubeadm join phase control-plane-join mark-control-plane --config kubeadm.conf
For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop. I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list. When I re-run kubeadm join I get this error.
hi, @MikaelSmith
Peer URLs already exists
i think this error only happens if the new member already exist in the cluster.
For context, how I've run into this: etcd failed on the host and was automatically removed from the member list. I went to re-add the host by re-running kubeadm join, but forgot to delete etcd's data dir, so etcd gets into a failure loop.
did you use --ignore-preflight-errors=all when calling kubeadm join?
normally kubeadm will fail if this directory is not empty.
I then quit kubeadm join to go investigate why etcd startup is timing out, see the notice to delete the data dir and do so, and etcd joins the member list.
i don't think etcd local storage is something we considered in terms of idempotency, so we might have to revise here.
When I re-run kubeadm join I get this error.
to clarify, you got the error from join even after clearing the data directory?
I did use --ignore-preflight-errors=all with the first call to `kubeadm join' (it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them).
Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.
it's part of a script from https://kurl.sh, so I can go back to them about why they ignore them
i see that this is a core principle of the service:
curl https://kurl.sh/latest | sudo bash
but overall not a recommended security practice.
Yes, I got the error even after clearing the data directory, I think because the etcd pod still existed from the prior run and succeeded in joining the cluster before I re-ran kubeadm join.
the idea of "join" for control-plane being idempotent is tricky, if you did a kubeadm reset on that node, the subsequent join would have worked i assume.
@fabriziopandini
i don't think this is feasible for 1.19, since it's unclear to me how we are planning to reuse existing etcd member snapshots. also preflight checks throw errors for a number of existing folders, so does this mean to make the join action idempotent one has to skip these checks?
@neolit123 my idea was to simply check if the etcd member does not already exist before adding the member, so if the command is executed again the operations become a NOP
WDYT about this solution?
ok, makes sense. i will rename this ticket slightly
/retitle kubeadm-join: don't re-add etcd member if it already exists
/lifecycle active
@fabriziopandini
PR is here:
https://github.com/kubernetes/kubernetes/pull/92118
i have not tested if this idea works...
if the member is in the cluster already, the new code should detect that. but my assumption here is that existing member would be unhealthy. so, i don't know if just creating this new server instance, without re-adding the member would make the existing member healthy.
if you know this works for sure, fine by me.
this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.
/close
@neolit123: Closing this issue.
In response to this:
this was cherry picked but someone mentioned that there could be an issue WRT how we get the list of etcd members in kubeadm. so let's keep an eye on that but open a new issue.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.