Charts: incubator/etcd scaling and recovery not working

Created on 18 Feb 2017 · 12Comments · Source: helm/charts

Etcd cluster does not recover if pod was deleted.
The issue seems to be in rejoining nodes with the same name.

Steps to reproduce:

deploy etcd from incubator(needs manual change from petset to statefulset)
check that cluster is healthy

> $ kubectl exec -it factual-crocodile-etcd-0 etcdctl cluster-health
member 6a7fe15c528ef50d is healthy: got healthy result from http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
member 8167cc828aa0f298 is healthy: got healthy result from http://factual-crocodile-etcd-1.factual-crocodile-etcd:2379
member b3d79057b17efe5f is healthy: got healthy result from http://factual-crocodile-etcd-2.factual-crocodile-etcd:2379
cluster is healthy

Delete any pod

> $ kubectl delete pod factual-crocodile-etcd-0
pod "factual-crocodile-etcd-0" deleted

Watch its logs after recreating

> $ kubectl logs -f factual-crocodile-etcd-0
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
sh: al-crocodile-etcd-0: bad number
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-0.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-1.factual-crocodile-etcd to come up
Waiting for factual-crocodile-etcd-2.factual-crocodile-etcd to come up
2017-02-18 14:44:40.028499 I | etcdmain: etcd Version: 2.2.5
2017-02-18 14:44:40.028612 I | etcdmain: Git SHA: bc9ddf2
2017-02-18 14:44:40.028617 I | etcdmain: Go Version: go1.5.3
2017-02-18 14:44:40.028623 I | etcdmain: Go OS/Arch: linux/amd64
2017-02-18 14:44:40.028632 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-02-18 14:44:40.033004 I | etcdmain: listening for peers on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.033062 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2017-02-18 14:44:40.033838 I | etcdmain: listening for client requests on http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.052563 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.053037 I | netutil: resolving factual-crocodile-etcd-0.factual-crocodile-etcd:2380 to 10.244.2.125:2380
2017-02-18 14:44:40.054740 I | etcdserver: name = factual-crocodile-etcd-0
2017-02-18 14:44:40.054756 I | etcdserver: data dir = /var/run/etcd/default.etcd
2017-02-18 14:44:40.054762 I | etcdserver: member dir = /var/run/etcd/default.etcd/member
2017-02-18 14:44:40.054766 I | etcdserver: heartbeat = 100ms
2017-02-18 14:44:40.054769 I | etcdserver: election = 1000ms
2017-02-18 14:44:40.054773 I | etcdserver: snapshot count = 10000
2017-02-18 14:44:40.054901 I | etcdserver: advertise client URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2379
2017-02-18 14:44:40.054911 I | etcdserver: initial advertise peer URLs = http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380
2017-02-18 14:44:40.054925 I | etcdserver: initial cluster = factual-crocodile-etcd-0=http://factual-crocodile-etcd-0.factual-crocodile-etcd:2380,factual-crocodile-etcd-1=http://factual-crocodile-etcd-1.factual-crocodile-etcd:2380,factual-crocodile-etcd-2=http://factual-crocodile-etcd-2.factual-crocodile-etcd:2380
2017-02-18 14:44:40.084529 I | etcdserver: starting member 6a7fe15c528ef50d in cluster d141f24a7d5c19f9
2017-02-18 14:44:40.084623 I | raft: 6a7fe15c528ef50d became follower at term 0
2017-02-18 14:44:40.084638 I | raft: newRaft 6a7fe15c528ef50d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2017-02-18 14:44:40.084645 I | raft: 6a7fe15c528ef50d became follower at term 1
2017-02-18 14:44:40.100290 E | rafthttp: failed to dial 8167cc828aa0f298 on stream MsgApp v2 (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.100367 E | rafthttp: failed to dial 8167cc828aa0f298 on stream Message (the member has been permanently removed from the cluster)
2017-02-18 14:44:40.101500 I | etcdserver: starting server... [version: 2.2.5, cluster version: to_be_decided]
2017-02-18 14:44:40.102363 E | etcdserver: the member has been permanently removed from the cluster
2017-02-18 14:44:40.102389 I | etcdserver: the data-dir used by this member must be removed.
2017-02-18 14:44:40.102800 E | rafthttp: failed to dial b3d79057b17efe5f on stream MsgApp v2 (net/http: request canceled while waiting for connection)
2017-02-18 14:44:40.103008 E | rafthttp: failed to dial b3d79057b17efe5f on stream Message (net/http: request canceled while waiting for connection)

On the next restart it shows only this:

> $ kubectl logs -f factual-crocodile-etcd-0
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Meanwhile in the logs of other nodes will be something like this:

2017-02-18 14:44:40.098380 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed
2017-02-18 14:44:40.098469 W | rafthttp: rejected the stream from peer 6a7fe15c528ef50d since it was removed

With scaledown/up there is similar problem - after scaledown it's not possible to scale up, since pod will be unable to rejoin.

lifecyclrotten

Source

lwolf

👍4

Most helpful comment

As far as I can see no changes were made to the chart, so, yes it is.

lwolf on 9 Jun 2017

👍7

All 12 comments

Is this still an issue?

lachie83 on 6 Jun 2017

As far as I can see no changes were made to the chart, so, yes it is.

lwolf on 9 Jun 2017

👍7

I am experiencing a similar issue.

Re-joining etcd member
client: etcd cluster is unavailable or misconfigured

I have a 3-node etcd cluster on top of a 3-node GKE container cluster on preemptible nodes - by design I am expecting to lose etcd pods every now and then and have new ones spun and the cluster recover.

It's not happening.

I don't know etcd well enough to understand and fix the problem. I would however, like it to work.

Am happy to try and help collect logs/data if someone with more etcd background is interested in taking a stab at a fix.

phyrwork on 2 Nov 2017

@lwolf @phyrwork @lachie83 This PR should fix that: https://github.com/kubernetes/charts/pull/2864

Elexy on 27 Nov 2017

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Feb 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 28 Mar 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 27 Apr 2018

seems to always be an issue...

aperrot42 on 17 Sep 2018

👍4

and it's still an issue :-(

Waiting for etcd-0.etcd to come up
Waiting for etcd-1.etcd to come up
Waiting for etcd-2.etcd to come up
ping: bad address 'etcd-2.etcd'
Waiting for etcd-2.etcd to come up
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

(Just copied from my cluster)

pcornelissen on 12 May 2019

👍4

I get this in AWS, but not in GKE.

lbornov2 on 29 Aug 2019

For anyone landing here looking for a workaround, in my case, all it took to get pods to join the cluster was to delete all of them. When they re-start they join each other in the cluster.

miguelaferreira on 14 Nov 2019

👍1

Workaround for this without loosing you data or recreating your whole cluster.
Use helm to scale down your cluster by 1 node
helm upgrade etcd incubator/etcd --set replicas=2
Wait for few minutes and all nodes will do rolling restart.
Scale it back up and voila :)