Etcd: OpenShift 3.11 CentOS7 becomes unresponsive etcd problem

Created on 7 Dec 2018 · 11Comments · Source: etcd-io/etcd

Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.

Openshift installs successfully on CentOS7 but then fails to respond after sometime.

Checking the Etcd Logs by running

/usr/local/bin/master-logs etcd etcd

reveals etcd shutting down over and over with following message

pkg/osutil: received terminated signal, shutting down...

Source

ajarv

All 11 comments

Checking the Etcd Logs by running
/usr/local/bin/master-logs etcd etcd

@ajarv thanks for the report to understand what happened we need to see the log files can you attach to issue as a file? Also if possible it would be good to see the metrics for the failed node. To obtain the metrics you just need to perform a GET against the /metrics endpoint. Something like

curl -s --key /etc/etcd/peer.key --cert /etc/etcd/peer.crt --cacert /etc/etcd/ca.crt https://$IP:2379/metrics > etcd_metrics.log

hexfusion on 7 Dec 2018

Checking the Etcd Logs by running
/usr/local/bin/master-logs etcd etcd

@ajarv thanks for the report to understand what happened we need to see the log files can you attach to issue as a file? Also if possible it would be good to see the metrics for the failed node. To obtain the metrics you just need to perform a GET against the /metrics endpoint. Something like
curl -s --key /etc/etcd/peer.key --cert /etc/etcd/peer.crt --cacert /etc/etcd/ca.crt https://$IP:2379/metrics > etcd_metrics.log

Problem is that etcd itself is shutting down.

ajarv on 7 Dec 2018

Logs

2018-12-07 15:44:04.120240 I | etcdmain: Git SHA: 1674e682f
2018-12-07 15:44:04.120243 I | etcdmain: Go Version: go1.8.7
2018-12-07 15:44:04.120246 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-07 15:44:04.120249 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2018-12-07 15:44:04.120279 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-12-07 15:44:04.120302 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-07 15:44:04.120927 I | embed: listening for peers on https://192.168.0.170:2380
2018-12-07 15:44:04.120967 I | embed: listening for client requests on 192.168.0.170:2379
2018-12-07 15:44:04.124065 I | etcdserver: name = deepak
2018-12-07 15:44:04.124078 I | etcdserver: data dir = /var/lib/etcd/
2018-12-07 15:44:04.124081 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-07 15:44:04.124084 I | etcdserver: heartbeat = 1000ms
2018-12-07 15:44:04.124088 I | etcdserver: election = 5000ms
2018-12-07 15:44:04.124091 I | etcdserver: snapshot count = 100000
2018-12-07 15:44:04.124107 I | etcdserver: advertise client URLs = https://192.168.0.170:2379
2018-12-07 15:44:04.143725 I | etcdserver: restarting member fb72e5827ecf6f9b in cluster 2ec241dd1570096a at commit index 5269
2018-12-07 15:44:04.143974 I | raft: fb72e5827ecf6f9b became follower at term 34
2018-12-07 15:44:04.143991 I | raft: newRaft fb72e5827ecf6f9b [peers: [], term: 34, commit: 5269, applied: 0, lastindex: 5269, lastterm: 34]
2018-12-07 15:44:04.153232 W | auth: simple token is not cryptographically signed
2018-12-07 15:44:04.157302 I | etcdserver: starting server... [version: 3.2.22, cluster version: to_be_decided]
2018-12-07 15:44:04.157812 I | etcdserver/membership: added member fb72e5827ecf6f9b [https://192.168.0.170:2380] to cluster 2ec241dd1570096a
2018-12-07 15:44:04.157893 N | etcdserver/membership: set the initial cluster version to 3.2
2018-12-07 15:44:04.157938 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-07 15:44:04.157993 I | embed: ClientTLS: cert = /etc/etcd/server.crt, key = /etc/etcd/server.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-07 15:44:12.144479 I | raft: fb72e5827ecf6f9b is starting a new election at term 34
2018-12-07 15:44:12.144558 I | raft: fb72e5827ecf6f9b became candidate at term 35
2018-12-07 15:44:12.144573 I | raft: fb72e5827ecf6f9b received MsgVoteResp from fb72e5827ecf6f9b at term 35
2018-12-07 15:44:12.144584 I | raft: fb72e5827ecf6f9b became leader at term 35
2018-12-07 15:44:12.144591 I | raft: raft.node: fb72e5827ecf6f9b elected leader fb72e5827ecf6f9b at term 35
2018-12-07 15:44:12.151792 I | etcdserver: published {Name:deepak ClientURLs:[https://192.168.0.170:2379]} to cluster 2ec241dd1570096a
2018-12-07 15:44:12.151803 I | embed: ready to serve client requests
2018-12-07 15:44:12.152019 I | embed: serving client requests on 192.168.0.170:2379
2018-12-07 15:45:13.287671 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/12/07 15:45:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.170:2379: getsockopt: connection refused"; Reconnecting to {192.168.0.170:2379 0  <nil>}
2018-12-07 15:45:13.787860 I | etcdserver: skipped leadership transfer for single member cluster
WARNING: 2018/12/07 15:45:13 grpc: addrConn.transportMonitor exits due to: context canceled

ajarv on 7 Dec 2018

2018-12-07 15:44:12.151803 I | embed: ready to serve client requests
2018-12-07 15:44:12.152019 I | embed: serving client requests on 192.168.0.170:2379
2018-12-07 15:45:13.287671 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/12/07 15:45:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.170:2379: getsockopt: connection refused"; Reconnecting to {192.168.0.170:2379 0 }
2018-12-07 15:45:13.787860 I | etcdserver: skipped leadership transfer for single member cluster
WARNING: 2018/12/07 15:45:13 grpc: addrConn.transportMonitor exits due to: context canceled

@ajarv I don't think this is an etcd issue persay but more OpenShift. It looks like etcd starts fine but then then etcd is killed. Basically I can reproduce the above with issuing master-restart etcd while tailing logs master-logs etcd etcd -f so this for some reason is in a crash loop. Maybe liveness probe is messed up?

/etc/origin/node/pods/etcd.yaml

Can you issue the liveness probe command against etcd manually? If that is it etcd shoudl run for about 45 seconds before being killed off.

master-restart etcd

etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health

hexfusion on 7 Dec 2018

👍1

Sam , Thanks a lot for providing useful commands.

Very likely an issue with the liveliness probe

The etcd does respond with cluster healthy status for a while then crashes and then it becomes live again

Fri Dec 7 13:13:38 EST 2018
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.0.170:2379: getsockopt: connection refused

error #0: dial tcp 192.168.0.170:2379: getsockopt: connection refused

[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:39 EST 2018
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout

error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout

[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:43 EST 2018
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout

error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout

[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:46 EST 2018
member fb72e5827ecf6f9b is healthy: got healthy result from https://192.168.0.170:2379
cluster is healthy

Let me tweak the livliness probe.

ajarv on 7 Dec 2018

After some testing, my issue was related to https://github.com/openshift/origin/issues/21609 (or https://bugzilla.redhat.com/show_bug.cgi?id=1655214) and a specific docker version. Replacing the existing version of docker with docker-1.13.1-75.git8633870.el7.centos seems to have fixed the issue.

stewartshea on 9 Dec 2018

👍1

I'm having same problem. I could make the cluster semi-stable with some tweaking on liveness prob configurations, but it's restarting every few hours ended with some kill signals:
pkg/osutil: received terminated signal, shutting down...

Also in describe of etcd pod, the last state is like this:

Last State:     Terminated
Reason:       Completed
Exit Code:    0
Started:      Mon, 10 Dec 2018 06:47:01 +0000
Finished:     Mon, 10 Dec 2018 07:54:20 +0000

Despite of that when I'm running liveness command it's always returning 0 code and cluster is healthy.

vahid-ashrafian on 10 Dec 2018

I'm having same problem. I could make the cluster semi-stable with some tweaking on liveness prob configurations, but it's restarting every few hours ended with some kill signals:
pkg/osutil: received terminated signal, shutting down..

If liveness probe does not cause termination then what does and why? These are really OpenShift/k8s questions. My guess is the fundamental issue is deploying OpenShift on storage which can not handle the I/O requirements of etcd. As the etcd/probe configurations appear static they would involve fine tuning based on your ENV to work properly. In short you can't expect etcd to perform as expected on a busy VM backed by HDD or even marginal SSD. The data-dir should be on dedicated storage if using HDD or at least very fast SSD if shared.

I want to help but this probably should move over to https://github.com/openshift/origin.

hexfusion on 10 Dec 2018

If liveness probe does not cause termination then what does and why?

I think it is the liveness probe that is causing termination since changing liveness probe parameters like timeout, period would change the time of termination cycle.

I want to help but this probably should move over to https://github.com/openshift/origin.

I'm thinking the same.
This bug report can be related to this issue: https://github.com/openshift/origin/issues/21609

vahid-ashrafian on 10 Dec 2018

👍1

Thanks @vahid-ashrafian, OK this is same as what @stewartshea reported so we have a common thread to the issue. I think we can close this as being a docker issue vs etcd. Since we have referenced the issue here folks should be able to find the solution. We can revisit/reopen if necessary.

hexfusion on 10 Dec 2018

After some testing, my issue was related to openshift/origin#21609 (or https://bugzilla.redhat.com/show_bug.cgi?id=1655214) and a specific docker version. Replacing the existing version of docker with docker-1.13.1-75.git8633870.el7.centos seems to have fixed the issue.

Thanks the issue was indeed with the docker version. I downgraded to the docker version docker-1.13.1-75.git8633870.el7.centos and all is working well.

Steps.

I am using the package https://github.com/gshipley/installcentos.git to install Open Shift on Cent OS

Uninstall openshift

...installcentos]$  sudo ansible-playbook -i inventory.ini openshift-ansible/playbooks/adhoc/uninstall.yml

Uninstall docker by

yum remove -y docker docker-client docker-common

modified the install-openshift.sh file to change docker-1.13.1 to 2:docker-1.13.1-75.git8633870.el7.centos.x86_64

  65 #yum install -y  wget git zile nano net-tools docker-1.13.1\
  66 yum install -y  wget git zile nano net-tools 2:docker-1.13.1-75.git8633870.el7.centos.x86_64 \
  67                                 bind-utils iptables-services \
  68                                 bridge-utils bash-completion \

Re run install script

installcentos]$ ./install-openshift.sh

For some reason if the /etc/resolv.conf does not get updated with below content . Manually modify it.

search cluster.local
nameserver <IP of the Host machine i.e.  the machine where openshift is running>

ajarv on 11 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

1 node etcd cluster is not working

suresh-chaudhari · 3Comments

etcdserver: rejected TLS peer connection error message to client is not helpful

gyuho · 4Comments

Restore issue with Kubernetes-1.5.1

r007m4n · 3Comments

What's the plan for version 3.5? When will it be released?

govine · 3Comments

Why it's named etcd?

cheyang · 3Comments