Charts: [incubator/etcd] etcd fails to reconnect to cluster

Created on 11 Jun 2018 · 7Comments · Source: helm/charts

Is this a request for help?:

No

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:
Helm 2.8.1
Kubernetes: 1.9.3

Which chart:
Patroni 0.8.0

What happened:
For development purposes I'm running etcd on spot instances in AWS. That sometimes results in nodes being killed and restarted.
Once etcd is dies it fails to rejoin the cluster. It appears that there is state stored on the persistent volume that does't match after the reboot.
The logs show:
Re-joining etcd member
client: etcd cluster is unavailable or misconfigured
This mean my postgres servers remain in replica state with none of them writable.

What you expected to happen:
The etcd should reconnect to the cluster and the postgres database should remain writable.

How to reproduce it (as minimally and precisely as possible):
kubectl delete pod on one or more of the etcd system.
I've seen the etcd servers cycle in different patterns during this case, so I'm not sure if it was one or two etcds down at a time, but at least one remained running.

Anything else we need to know:
This leaves the cluster in a difficult state. What seems to fix the problem is to scale the stateful set down to 0, then remove the persistent volumes, and then scale the stateful set back up.

Source

daryllstrauss

👍1

Most helpful comment

FYI it still happens (frequently!)

pcornelissen on 12 May 2019

👍2

All 7 comments

I'm running into the same, it's reasonably easy to replicate, chart deployed with on a Kubernetes Cluster in Microsoft Azure (not ACS/AKS).

helm install --name patroni incubator/patroni --set credentials.superuser="$(< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c32)",credentials.admin="$(< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c32)",credentials.standby="$(< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c32)",persistentVolume.size=20Gi

Once deployment is complete the following pods are visible:

patroni-0                              1/1       Running   0          6m
patroni-1                              1/1       Running   0          5m
patroni-2                              1/1       Running   0          3m
patroni-3                              1/1       Running   0          2m
patroni-4                              1/1       Running   0          1m
patroni-etcd-0                         1/1       Running   0          6m
patroni-etcd-1                         1/1       Running   0          5m
patroni-etcd-2                         1/1       Running   0          3m

Deleting patroni-etcd-0 results in the following logging and a crash loop. The cluster remains up during this time but is in a semi-degraded state.

Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

In order to get etcd back to normal we need to scale it back to zero replicas, delete it's pvcs and then scale back up. During this time traffic is denied to Postgres.

I'll spend a bit more time hacking on this to see if I can come up with a solution for this.

cpressland on 25 Jun 2018

After digging deeper, Patroni uses the helm chart for etcd to launch it. So it appears this really should be a bug against etcd rather than Patroni.

daryllstrauss on 25 Jun 2018

@daryllstrauss correct, the issue seems to be with
https://github.com/kubernetes/charts/blob/10ec54a4bcb1c9f5e81fd763a41ce5660c089acf/incubator/etcd/templates/statefulset.yaml#L104-L106

It seems like if the pod is deleted /var/run/etcd/default.etcd still exists in some cases regardless of the preStop lifecycle policy calling rm -rf /var/run/etcd/*

https://github.com/kubernetes/charts/blob/10ec54a4bcb1c9f5e81fd763a41ce5660c089acf/incubator/etcd/templates/statefulset.yaml#L74-L77

If I were to guess, etcd writes something back to /var/run/etcd/default.etcd when it terminates regardless of it's data having already been deleted. I'll keep hacking away and put in a pull request when I have a viable solution if thats agreeable? Seems like a simple fix.

cpressland on 25 Jun 2018

The above Pull Request may also fix #4883. But I've not tested it.

cpressland on 25 Jun 2018

FYI it still happens (frequently!)

pcornelissen on 12 May 2019

👍2