Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.
Openshift installs successfully on CentOS7 but then fails to respond after sometime.
Checking the Etcd Logs by running
/usr/local/bin/master-logs etcd etcd
reveals etcd shutting down over and over with following message
pkg/osutil: received terminated signal, shutting down...
Checking the Etcd Logs by running
/usr/local/bin/master-logs etcd etcd
@ajarv thanks for the report to understand what happened we need to see the log files can you attach to issue as a file? Also if possible it would be good to see the metrics for the failed node. To obtain the metrics you just need to perform a GET against the /metrics endpoint. Something like
curl -s --key /etc/etcd/peer.key --cert /etc/etcd/peer.crt --cacert /etc/etcd/ca.crt https://$IP:2379/metrics > etcd_metrics.log
Checking the Etcd Logs by running
/usr/local/bin/master-logs etcd etcd@ajarv thanks for the report to understand what happened we need to see the log files can you attach to issue as a file? Also if possible it would be good to see the metrics for the failed node. To obtain the metrics you just need to perform a GET against the /metrics endpoint. Something like
curl -s --key /etc/etcd/peer.key --cert /etc/etcd/peer.crt --cacert /etc/etcd/ca.crt https://$IP:2379/metrics > etcd_metrics.log
Problem is that etcd itself is shutting down.
Logs
2018-12-07 15:44:04.120240 I | etcdmain: Git SHA: 1674e682f
2018-12-07 15:44:04.120243 I | etcdmain: Go Version: go1.8.7
2018-12-07 15:44:04.120246 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-07 15:44:04.120249 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2018-12-07 15:44:04.120279 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-12-07 15:44:04.120302 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-07 15:44:04.120927 I | embed: listening for peers on https://192.168.0.170:2380
2018-12-07 15:44:04.120967 I | embed: listening for client requests on 192.168.0.170:2379
2018-12-07 15:44:04.124065 I | etcdserver: name = deepak
2018-12-07 15:44:04.124078 I | etcdserver: data dir = /var/lib/etcd/
2018-12-07 15:44:04.124081 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-07 15:44:04.124084 I | etcdserver: heartbeat = 1000ms
2018-12-07 15:44:04.124088 I | etcdserver: election = 5000ms
2018-12-07 15:44:04.124091 I | etcdserver: snapshot count = 100000
2018-12-07 15:44:04.124107 I | etcdserver: advertise client URLs = https://192.168.0.170:2379
2018-12-07 15:44:04.143725 I | etcdserver: restarting member fb72e5827ecf6f9b in cluster 2ec241dd1570096a at commit index 5269
2018-12-07 15:44:04.143974 I | raft: fb72e5827ecf6f9b became follower at term 34
2018-12-07 15:44:04.143991 I | raft: newRaft fb72e5827ecf6f9b [peers: [], term: 34, commit: 5269, applied: 0, lastindex: 5269, lastterm: 34]
2018-12-07 15:44:04.153232 W | auth: simple token is not cryptographically signed
2018-12-07 15:44:04.157302 I | etcdserver: starting server... [version: 3.2.22, cluster version: to_be_decided]
2018-12-07 15:44:04.157812 I | etcdserver/membership: added member fb72e5827ecf6f9b [https://192.168.0.170:2380] to cluster 2ec241dd1570096a
2018-12-07 15:44:04.157893 N | etcdserver/membership: set the initial cluster version to 3.2
2018-12-07 15:44:04.157938 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-07 15:44:04.157993 I | embed: ClientTLS: cert = /etc/etcd/server.crt, key = /etc/etcd/server.key, ca = , trusted-ca = /etc/etcd/ca.crt, client-cert-auth = true
2018-12-07 15:44:12.144479 I | raft: fb72e5827ecf6f9b is starting a new election at term 34
2018-12-07 15:44:12.144558 I | raft: fb72e5827ecf6f9b became candidate at term 35
2018-12-07 15:44:12.144573 I | raft: fb72e5827ecf6f9b received MsgVoteResp from fb72e5827ecf6f9b at term 35
2018-12-07 15:44:12.144584 I | raft: fb72e5827ecf6f9b became leader at term 35
2018-12-07 15:44:12.144591 I | raft: raft.node: fb72e5827ecf6f9b elected leader fb72e5827ecf6f9b at term 35
2018-12-07 15:44:12.151792 I | etcdserver: published {Name:deepak ClientURLs:[https://192.168.0.170:2379]} to cluster 2ec241dd1570096a
2018-12-07 15:44:12.151803 I | embed: ready to serve client requests
2018-12-07 15:44:12.152019 I | embed: serving client requests on 192.168.0.170:2379
2018-12-07 15:45:13.287671 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/12/07 15:45:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.170:2379: getsockopt: connection refused"; Reconnecting to {192.168.0.170:2379 0 <nil>}
2018-12-07 15:45:13.787860 I | etcdserver: skipped leadership transfer for single member cluster
WARNING: 2018/12/07 15:45:13 grpc: addrConn.transportMonitor exits due to: context canceled
2018-12-07 15:44:12.151803 I | embed: ready to serve client requests
2018-12-07 15:44:12.152019 I | embed: serving client requests on 192.168.0.170:2379
2018-12-07 15:45:13.287671 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/12/07 15:45:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.170:2379: getsockopt: connection refused"; Reconnecting to {192.168.0.170:2379 0}
2018-12-07 15:45:13.787860 I | etcdserver: skipped leadership transfer for single member cluster
WARNING: 2018/12/07 15:45:13 grpc: addrConn.transportMonitor exits due to: context canceled
@ajarv I don't think this is an etcd issue persay but more OpenShift. It looks like etcd starts fine but then then etcd is killed. Basically I can reproduce the above with issuing master-restart etcd while tailing logs master-logs etcd etcd -f so this for some reason is in a crash loop. Maybe liveness probe is messed up?
/etc/origin/node/pods/etcd.yaml
Can you issue the liveness probe command against etcd manually? If that is it etcd shoudl run for about 45 seconds before being killed off.
master-restart etcd
etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Sam , Thanks a lot for providing useful commands.
Very likely an issue with the liveliness probe
The etcd does respond with cluster healthy status for a while then crashes and then it becomes live again
Fri Dec 7 13:13:38 EST 2018
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.0.170:2379: getsockopt: connection refused
error #0: dial tcp 192.168.0.170:2379: getsockopt: connection refused
[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:39 EST 2018
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout
error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout
[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:43 EST 2018
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout
error #0: client: endpoint https://192.168.0.170:2379 exceeded header timeout
[root@deepak bin]# echo $(date) ; etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://$IP:2379 cluster-health
Fri Dec 7 13:13:46 EST 2018
member fb72e5827ecf6f9b is healthy: got healthy result from https://192.168.0.170:2379
cluster is healthy
Let me tweak the livliness probe.
After some testing, my issue was related to https://github.com/openshift/origin/issues/21609 (or https://bugzilla.redhat.com/show_bug.cgi?id=1655214) and a specific docker version. Replacing the existing version of docker with docker-1.13.1-75.git8633870.el7.centos seems to have fixed the issue.
I'm having same problem. I could make the cluster semi-stable with some tweaking on liveness prob configurations, but it's restarting every few hours ended with some kill signals:
pkg/osutil: received terminated signal, shutting down...
Also in describe of etcd pod, the last state is like this:
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 10 Dec 2018 06:47:01 +0000
Finished: Mon, 10 Dec 2018 07:54:20 +0000
Despite of that when I'm running liveness command it's always returning 0 code and cluster is healthy.
I'm having same problem. I could make the cluster semi-stable with some tweaking on liveness prob configurations, but it's restarting every few hours ended with some kill signals:
pkg/osutil: received terminated signal, shutting down..
If liveness probe does not cause termination then what does and why? These are really OpenShift/k8s questions. My guess is the fundamental issue is deploying OpenShift on storage which can not handle the I/O requirements of etcd. As the etcd/probe configurations appear static they would involve fine tuning based on your ENV to work properly. In short you can't expect etcd to perform as expected on a busy VM backed by HDD or even marginal SSD. The data-dir should be on dedicated storage if using HDD or at least very fast SSD if shared.
I want to help but this probably should move over to https://github.com/openshift/origin.
If liveness probe does not cause termination then what does and why?
I think it is the liveness probe that is causing termination since changing liveness probe parameters like timeout, period would change the time of termination cycle.
I want to help but this probably should move over to https://github.com/openshift/origin.
I'm thinking the same.
This bug report can be related to this issue: https://github.com/openshift/origin/issues/21609
Thanks @vahid-ashrafian, OK this is same as what @stewartshea reported so we have a common thread to the issue. I think we can close this as being a docker issue vs etcd. Since we have referenced the issue here folks should be able to find the solution. We can revisit/reopen if necessary.
After some testing, my issue was related to openshift/origin#21609 (or https://bugzilla.redhat.com/show_bug.cgi?id=1655214) and a specific docker version. Replacing the existing version of docker with
docker-1.13.1-75.git8633870.el7.centosseems to have fixed the issue.
Thanks the issue was indeed with the docker version. I downgraded to the docker version docker-1.13.1-75.git8633870.el7.centos and all is working well.
Steps.
I am using the package https://github.com/gshipley/installcentos.git to install Open Shift on Cent OS
...installcentos]$ sudo ansible-playbook -i inventory.ini openshift-ansible/playbooks/adhoc/uninstall.yml
yum remove -y docker docker-client docker-common
65 #yum install -y wget git zile nano net-tools docker-1.13.1\
66 yum install -y wget git zile nano net-tools 2:docker-1.13.1-75.git8633870.el7.centos.x86_64 \
67 bind-utils iptables-services \
68 bridge-utils bash-completion \
installcentos]$ ./install-openshift.sh
For some reason if the /etc/resolv.conf does not get updated with below content . Manually modify it.
search cluster.local
nameserver <IP of the Host machine i.e. the machine where openshift is running>