etcd join race
I did find #1319 #2001 #1793
BUG REPORT
kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:18:19Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version):Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:19:15Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.4-1+d5ee6cddf7e896", GitCommit:"d5ee6cddf7e896fb8556cad24a610df657ecd824", GitTreeState:"clean", BuildDate:"2019-10-03T22:16:41Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
uname -a):This should happen in any env if our analysis is valid.
For setting up a 3 (or multi)-node cluster, we understand that the etcd learner is coming but we can't refactor our code for now to adopt that. And we really need to do this concurrently. So wondering if any workaround/suggestion can be offered.
We use kubeadm to boostrap a 3-CP cluster, however, we need to inject some customization along the way, so we are calling the join phase one by one instead of simply kubeadm join.
This issue only happens only when adding the 3rd member.
When we call this command on the 3rd node
kubeadm join phase control-plane-join etcd
Very rarely we observed that the generated etcd manifest (/etc/kubernetes/manifest/etcd.yaml) has incorrect --initial-cluster value.
Assuming etcd-0 is the first member, etcd-1 is the second and etcd-2 is the third. A correct --initial-cluster value for etcd-2 might look like this
--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-1=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
However, in this rare case, we are getting something like this
--initial-cluster=etcd-0=https://192.168.0.1:2380,etcd-2=https://192.168.0.2:2380,etcd-2=https://192.168.0.3:2380
Basically the name of etcd-1 was incorrectly configured as etcd-2, this incorrect manifest results in etcd container failed to start and complain:
etcdmain: error validating peerURLs {"ClusterID":"31c63dd3d7c3da6a","Members":[{"ID":"1b98ed58f9be3e7d","RaftAttributes":{"PeerURLs":["https://192.168.0.2:2380"]},"Attributes":{"Name":"etcd-1","ClientURLs":["https://192.168.0.2:2379"]}},{"ID":"6d631ff1c84da117","RaftAttributes":{"PeerURLs":["https://192.168.0.3:2380"]},"Attributes":{"Name":"","ClientURLs":[]}},{"ID":"f0c11b3401371571","RaftAttributes":{"PeerURLs":["https://192.168.0.1:2380"]},"Attributes":{"Name":"etcd-0","ClientURLs":["https://192.168.0.1:2379"]}}],"RemovedMemberIDs":[]}: member count is unequal\n","stream":"stderr","time":"2020-01-08T17:27:52.63704563Z"}
We think this error message was because the manifest --initial-cluster has only 2 unique names there while the etcd cluster actually has 3 members.
We spent some time tracking the code to see what could be the issue and we have a theory here.
kubeadm join phase control-plane-join etcdThen the above command calls this
It then calls etcdClient.AddMember()
func (c *Client) AddMember(name string, peerAddrs string) here name is the current master's Name and peerAddrs is the current master's peerURL.
Then in L290: resp, err = cli.MemberAdd(ctx, []string{peerAddrs})
it calls the real MemberAdd which will return a []Member that includes the currently-being-added one.
So the response of this MemberAdd() will have all previous members and current member.
Once AddMember() receives the response
for _, m := range resp.Members {
// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
if m.Name == "" {
ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
} else {
ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}
}
Here resp is the response from MemberAdd() as described above. And this section is to insert the given Name for the members that do not have Name. We think that it is expected only the currently-being-added member that should be the only one that does not have Name, it loops the resp.Members, find the member that does not have a Name and set name as the member Name.
But if the resp.Members, which in this case returned 3 members (because this happens in the 3rd member), there were somehow 2 Members that do not have "Name" because the 2nd member had just joined the etcd cluster, but the etcd container of 2nd is still coming up, in this case, "etcdctl member list" would return something like
cat ../../../../commands/etcdctl_member-list.txt
1b98ed58f9be3e7d, started, etcd-0, https://20.20.0.37:2380, https://192.168.0.1:2379
6d631ff1c84da117, unstarted, , https://192.168.0.2:2380, <-- this is etcd-1, but not started yet so no Name
In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.
We concluded that this issue only happens if the 3rd member that is running MemberAdd during which the 2nd Member is not yet started which is considered racy.
For this ticket, we want to understand that:
The generated etcd manifest should have correct --initial-cluster value.
Note that this happens really rarely, the frequency is probably 1/1000
It's worth noting that the following is speculation:
In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.
It's also worth noting that we're not sure what the contract is with etcd for the names of the active peers returned from the MemberAdd call.
I don't think we can perform the 3rd member add before the second member has started and sync'd as the cluster wouldn't have quorum. I'd assume that means it has, by the time the 3rd member is added, updated the 2nd members name. So either (assuming we're correct in the cause):
Finally, it's worth noting that we do not yet have verbose logging for the issue, and that the logging in kubeadm currently will not allow us to directly confirm this hypothesis as we never log the actual values returned by etcd at any level.
thanks for logging the issue.
/kind bug
/priority important-longterm
/assign
@echu23 thanks for the detailed analysis
I don't have a simple way to validate your analysis, but your thesis seems reasonable to me.
I see a possible fix for this by changing the loop on resp.Members:
for _, m := range resp.Members {
// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
if m.Name == "" {
if m.PeerURLs[0] == peerAddrs {
ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
}
} else {
ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}
}
With this change we are going to discard any member with empty name and PeerUrl different than the peer URL of the joining node (192.168.0.2:2380 in your example).
If the resulting list has two members instead of three it should not be a problem for the new etcd node, because it will be informed about the third member by the raft protocol after joining.
If instead the resulting list has one member this will be a problem, but according to your analysis this can't happen (and it seems reasonable to me because join starts only the first control plane/etcd is up and running)
@echu23 @neolit123 opinions?
If we agree on the solution I can send a patch a so we can check that this does not introduce regression, but for the real parallel join test we should rely on @echu23 infrastructure
can also be:
for _, m := range resp.Members {
// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
if m.Name == "" && m.PeerURLs[0] == peerAddrs {
ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
continue
}
ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}
In this case, there are 2 out of 3 Members that do not have Name, hence the above for loop inserted the 3rd Name(etcd-2) to both 2nd and 3rd Member.
from my understanding @fabriziopandini proposal can fix this problem.
it will set one of the names (instead of 2, which is the bug) and then hopefully the other call to etcdClient.AddMember() from the 3rd member will sets its name too.
after this change if the code is still flaky for other reasons, we have the option to fully serialize the etcd member add with a lock managed by kubeadm.
the alternative is to just wait until we move to etcd 3.5 and see how stable the learners are.
but something to note about etcd learners is that they are a currently a maximum of 1 and it is not clear when multiple learners will be supported:
https://github.com/etcd-io/etcd/issues/11401
etcdadm is managing this slightly differently:
https://github.com/kubernetes-sigs/etcdadm/blob/master/cmd/join.go#L106-L135
post "add", it obtains the current member and other members from the response.
then appends to the initial cluster based on ID, which seems preferable.
Just another quick question to confirm the intention of the code, so from this
ret := []Member{}
for _, m := range resp.Members {
// fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
if m.Name == "" {
ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
} else {
ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
}
}
Does it expect that resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?
Is it even possible to have a resp.Members that contains more than 1 Member that do not have Name?
I am asking because I am trying to reproduce this issue using kind, and I was not able to reproduce it.
Of course as I mentioned, this only happens maybe 1/1000. I will keep trying.
My reproduction flow using kind is this:
kind-control-plane, which is always the first memberkind-control-plane2 , or 3 or whatever has been joined./etc/kubernetes/manifest/etcd.yaml, which in turn will shutdown etcd container on this memberetcdctl member remove and etcdctl member add again, this will put that new member in unstarted and observe if the issue reproduces. So 2 questions here:
resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?resp.Members that contains more than 1 Member that do not have Name? Does it expect that resp.Members must ONLY have 1 and only 1 member that does not have Name (which should have been the currently-being-added member)? Is this a true statement?
Is it even possible to have a resp.Members that contains more than 1 Member that do not have Name?
as you pointed out with the discoveries in the OP, it seems like more than one members without name can occur in resp.Members.
Do you have any suggestion on how to reproduce this issue?
might be easier with kinder:
https://github.com/kubernetes/kubeadm/tree/master/kinder
build kinder using go > 1.13
go build
````
and call:
./kinder test workflow ./ci/workflows/regular-1.15.yaml --dry-run
```
this will give you a set of commands.
kinder do join command, but note that kinder create cluster allows a custom number of CP and worker nodes.docker execs kubeadm join on all remaining CP nodes.Use kind with custom config file to deploy a 10-CP cluster.
you seem to have a setup to envy.
but IMO it is better to have an ODD number of etcd/CP nodes.
@neolit123 @echu23 if I got this thread right we agree on the proposed change? Can I send the patch (with the @neolit123 variant)?
So the proposed change is to discard the member of the member has no Name and peerURLs does not match the current one, right? This means that if this issue happens, the 鈥攊nitial-cluster in etcd.yaml would have only 2 members: the first member and the current member. If this will still configure a valid etcd cluster I鈥檓 totally fine with it as long as it fixes the issue and form a health etcd cluster.
Sent with GitHawk
@fabriziopandini
if I got this thread right we agree on the proposed change?
the way etcdadm does the matching with member IDs, is overall better:
https://github.com/kubernetes-sigs/etcdadm/blob/master/cmd/join.go#L106-L135
but i don't see when matching peerAddrs shouldn't be fine too.
@echu23
So the proposed change is to discard the member of the member has no Name and peerURLs does not match the current one, right?
yes.
This means that if this issue happens, the 鈥攊nitial-cluster in etcd.yaml would have only 2 members: the first member and the current member. If this will still configure a valid etcd cluster I鈥檓 totally fine with it as long as it fixes the issue and form a health etcd cluster.
if the 2nd and 3rd members join at the same time, MemberAdd on the second CP might return 3 members - 2 of which don't have name (2nd and 3rd).
with the proposed patch the initial cluster on the second CP will end up only with the first and 2nd member.
but because this is racy, i think the same can happen for the MemberAdd on the 3rd CP node. if it also obtains 3 members, 2 of which don't have a name it will also end up with an initial cluster with 2 members.
so the etcd boostrap can look like this:
Ok so this
Will form a health cluster, right? I鈥檓 totally fine with this.
Sent with GitHawk
i think it has side effects. i can try simulating this tomorrow.
@fabriziopandini have you experimented with this?
if the initial cluster does not matter, we can always do this in kubeadm:
CP1: etcd-member1
CP2: etcd-member1, etcd-member2
CP3: etcd-member1, etcd-member3
CPN: etcd-member1, etcd-memberN
yet something is telling me it's not right.
also i would like to bring something from the etcd docs:
If adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members.
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/runtime-configuration.md
i.e. concurrent join is by design not supported...
Yeah, we are aware of that concurrent etcd join is not recommended. However, if this issue is legit, this can also happen for sequential join.
Image we are configuring a 3-node etcd cluster, 1st and 2nd were successful, when joining the 3rd, somehow 2nd etcd is Unstarted, in this case, should the 3rd join pass or fail?
To make an existing member unstarted, we can first shutdown an etcd container and remove that member then add that member back.
I am still not fully convinced that my theory is true because I can鈥檛 reproduce it. But I also can not find another theory...
Sent with GitHawk
Image we are configuring a 3-node etcd cluster, 1st and 2nd were successful, when joining the 3rd, somehow 2nd etcd is Unstarted, in this case, should the 3rd join pass or fail?
as the docs say:
verify it starts correctly before adding more new members
so if the 2nd fails the 3rd should not be added.
I am still not fully convinced that my theory is true because I can鈥檛 reproduce it. But I also can not find another theory...
i think it's valid.
will double check this tomorrow:
https://github.com/kubernetes/kubeadm/issues/2005#issuecomment-575953236
i experimented and observed the following:
joined sequentially two kubeadm CP nodes to an existing CP node, with a modified version of kubeadm.
the modified version simulated the case of two members not having names:
CP1: etcd-member1
CP2: etcd-member1, etcd-member2
CP3: etcd-member1, etcd-member3
it makes it so that CP3 only includes the 1st and 3rd member.
but the etcd Pod on CP3 fails with the following:
2020-01-19 18:13:45.785514 C | etcdmain: error validating peerURLs {ClusterID:5af0857ece1ce0e5 Members:[&{ID:2ef094c895837ef3 RaftAttributes:{PeerURLs:[https://172.17.0.3:2380]} Attributes:{Name: ClientURLs:[]}} &{ID:95047bc387b5a81a RaftAttributes:{PeerURLs:[https://172.17.0.4:2380]} Attributes:{Name:kinder-regular-control-plane-2 ClientURLs:[https://172.17.0.4:2379]}} &{ID:952f31ff200093ba RaftAttributes:{PeerURLs:[https://172.17.0.5:2380]} Attributes:{Name:kinder-regular-control-plane-1 ClientURLs:[https://172.17.0.5:2379]}}] RemovedMemberIDs:[]}: member count is unequal
so it has to include all members:
CP3: etcd-member1, etcd-member2, etcd-member3
i tried investigating what is considered a valid --initial-cluster:
thus i ended up with the following change:
ret := []Member{}
for _, m := range resp.Members {
if peerAddrs == m.PeerURLs[0] {
ret = append(ret, Member{Name: name, PeerURL: peerAddrs})
continue
}
ret = append(ret, Member{Name: strconv.FormatUint(m.ID, 16), PeerURL: m.PeerURLs[0]})
}
this ends doing the following:
$ docker exec kinder-regular-control-plane-1 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
- --initial-cluster=kinder-regular-control-plane-1=https://172.17.0.5:2380
$ docker exec kinder-regular-control-plane-2 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
- --initial-cluster=kinder-regular-control-plane-2=https://172.17.0.4:2380,952f31ff200093ba=https://172.17.0.5:2380
$ docker exec kinder-regular-control-plane-3 cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster
- --initial-cluster=kinder-regular-control-plane-3=https://172.17.0.3:2380,7db52fe1fd991808=https://172.17.0.4:2380,952f31ff200093ba=https://172.17.0.5:2380
the list of members still respects the --name passed when each etcd instance is started:
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/kubernetes/pki/etcd/ca.crt --cert-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
4216c54c1f9153b4: name=kinder-regular-control-plane-3 peerURLs=https://172.17.0.3:2380 clientURLs=https://172.17.0.3:2379 isLeader=false
7db52fe1fd991808: name=kinder-regular-control-plane-2 peerURLs=https://172.17.0.4:2380 clientURLs=https://172.17.0.4:2379 isLeader=false
952f31ff200093ba: name=kinder-regular-control-plane-1 peerURLs=https://172.17.0.5:2380 clientURLs=https://172.17.0.5:2379 isLeader=true
so this is my proposal to workaround the problem, but it introduces a minor change in the way kubeadm writes the etcd.yaml.
also i've only tested this with a total of 5 etcd members. 4 joining concurrently.
however on my setup i'm seeing other problems, which happen quite often:
"level":"warn","ts":"2020-01-19T20:20:33.300Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-875a285e-36a5-4221-a4f8-88d9a56ec293/172.17.0.5:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2020-01-19T20:20:36.533Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-875a285e-36a5-4221-a4f8-88d9a56ec293/172.17.0.5:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
etcdserver: re-configuration failed due to not enough started members
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:144
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:236
[download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
rpc error: code = Unavailable desc = etcdserver: leader changed
error downloading the secret
k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts.DownloadCerts
/home/lubo-it/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts/copycerts.go:227
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runControlPlanePrepareDownloadCertsPhaseLocal
the following patch makes the 4 member concurrent join more reliable:
diff --git a/cmd/kubeadm/app/util/etcd/etcd.go b/cmd/kubeadm/app/util/etcd/etcd.go
index 9d3c6be046b..0e5ad4434e8 100644
--- a/cmd/kubeadm/app/util/etcd/etcd.go
+++ b/cmd/kubeadm/app/util/etcd/etcd.go
@@ -38,11 +38,11 @@ import (
"k8s.io/kubernetes/cmd/kubeadm/app/util/config"
)
-const etcdTimeout = 2 * time.Second
+const etcdTimeout = 20 * time.Second
// Exponential backoff for etcd operations
var etcdBackoff = wait.Backoff{
- Steps: 9,
+ Steps: 16,
Duration: 50 * time.Millisecond,
Factor: 2.0,
Jitter: 0.1,
@@ -130,7 +130,7 @@ func NewFromCluster(client clientset.Interface, certificatesDir string) (*Client
// dialTimeout is the timeout for failing to establish a connection.
// It is set to 20 seconds as times shorter than that will cause TLS connections to fail
// on heavily loaded arm64 CPUs (issue #64649)
-const dialTimeout = 20 * time.Second
+const dialTimeout = 40 * time.Second
// Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
func (c *Client) Sync() error {
@@ -303,12 +303,11 @@ func (c *Client) AddMember(name string, peerAddrs string) ([]Member, error) {
// Returns the updated list of etcd members
ret := []Member{}
for _, m := range resp.Members {
- // fixes the entry for the joining member (that doesn't have a name set in the initialCluster returned by etcd)
- if m.Name == "" {
- ret = append(ret, Member{Name: name, PeerURL: m.PeerURLs[0]})
- } else {
- ret = append(ret, Member{Name: m.Name, PeerURL: m.PeerURLs[0]})
+ if peerAddrs == m.PeerURLs[0] {
+ ret = append(ret, Member{Name: name, PeerURL: peerAddrs})
+ continue
}
+ ret = append(ret, Member{Name: strconv.FormatUint(m.ID, 16), PeerURL: m.PeerURLs[0]})
}
// Add the new member client address to the list of endpoints
Yes, this proposal was what we had thought of as well that there only need to be 3 unique Name. We are ok with this change if no regression is introduced. Thanks for the quick turnaround.
Sent with GitHawk
@neolit123 I'm ok as well, but If possible I would apply the random name only if the name is empty (so the created manifest remain unchanged for all the cases except the race condition discussed in this thread)
@fabriziopandini
this seems reasonable, given the potential rarity of the case, even if the "ID-names for all other members" is more deterministic.
it's unfortunate that given the nature of etcd bootstrap our etcd.yaml is overall non-deterministic WRT the --initial-cluster flag.
/lifecycle active
PR is here https://github.com/kubernetes/kubernetes/pull/87505
Most helpful comment
@neolit123 I'm ok as well, but If possible I would apply the random name only if the name is empty (so the created manifest remain unchanged for all the cases except the race condition discussed in this thread)