Rke: etcd component status returns unhealthy

Created on 24 Oct 2018 · 6Comments · Source: rancher/rke

RKE version:

Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Containers: 34
 Running: 31
 Paused: 0
 Stopped: 3
Images: 23
Server Version: 17.03.2-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 190
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1067-aws
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858 GiB
Name: ip-172-31-5-156
ID: EVYS:MBKF:63RZ:GDEU:MFSU:LLMN:NHLZ:N3YP:SPT2:COMZ:5JT7:H5UC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

ubuntu@ip-172-31-5-156:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
AWS

cluster.yml file:

nodes:
  - address: ec2-18-981-98-xxx.us-east-2.compute.amazonaws.com
    internal_address: 172.81.1.86
    user: ubuntu
    role: [controlplane,worker,etcd]
    ssh_key_path: rke-keypair.pem
  - address: ec2-18-191-161-xxx.us-east-2.compute.amazonaws.com
    internal_address: 172.31.3.176
    user: ubuntu
    role: [controlplane,worker,etcd]
    ssh_key_path: rke-keypair.pem
  - address: ec2-18-123-22-xxx.us-east-2.compute.amazonaws.com
    internal_address: 172.21.0.123
    user: ubuntu
    role: [controlplane,worker,etcd]
    ssh_key_path: rke-keypair.pem

cloud_provider:
    name: aws

Steps to Reproduce:

Run rke up --config rancher-cluster.yml
Get component status with kubectl get cs

Results:
etcd component's status returns as unhealthy.

➜  kubectl get cs
NAME                 STATUS      MESSAGE                                                                 ERROR
controller-manager   Healthy     ok
scheduler            Healthy     ok
etcd-2               Healthy     {"health": "true"}
etcd-1               Healthy     {"health": "true"}
etcd-0               Unhealthy   Get https://172.31.5.156:2379/health: net/http: TLS handshake timeout

What's more interesting is if I run kubectl get cs again, etcd-1 returns as unhealthy

➜  kubectl get cs
NAME                 STATUS      MESSAGE                                                                 ERROR
scheduler            Healthy     ok
controller-manager   Healthy     ok
etcd-0               Healthy     {"health": "true"}
etcd-2               Healthy     {"health": "true"}
etcd-1               Unhealthy   Get https://172.31.5.156:2379/health: net/http: TLS handshake timeout

But etcd-2 never returns unhealthy.

I'd appreciate if anyone can help me to understand what is going on with this.

On inspection of etcd logs, here's what I found:

2018-10-24 07:55:00.821282 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[5cd651bd2233737e]=b7c2d12a1db23c20, local=b80c2a51ca9c90a4)
2018-10-24 07:55:00.912345 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[d6b104e2cd4233e7]=b7c2d12a1db23c20, local=b80c2a51ca9c90a4)

statumore-info

Source

ishantanu

Most helpful comment

@superseb I can reproduce this. By adding a controleplane node to an existing cluster the new node won't join the existing ETCD cluster because the ETCD node is created with a new cluster ID because of the --initial-cluster-state=new flag

EDIT: This only occurs when you first remove a controlplane node from a cluster and then re-add it without removing the /var/lib/etcd data directory. I resolved it by removing the /var/lib/etcd directory and then re-adding the node in the cluster.

piwi91 on 11 Nov 2019

👍4

All 6 comments

@iamShantanu101 are there any finite steps to reproduce the issue? If yes, could you please provide them

alena1108 on 30 Nov 2018

So here's what I did:

Create an RKE cluster of 3 nodes with all nodes acting as worker,etcd,controlplane.
Remove the cluster with rke remove.
Replace a node in rancher-cluster.yml config file
Run rke up --config rancher-cluster.yml

To further explain the issue, I faced this issue twice so far. From my inspection:

The nodes which are being reused to create a new RKE cluster are causing the issue.
The etcd container on these nodes was in a Restarting state.
Looking at the logs, I saw etcdmain: member 32d3fdedbe9b75fb has already been bootstrapped
This means, somehow etcd recognizes the current node as being already a member (might be related to cleaning up etcd data dir but not sure about this).

And here's what I did to resolve it (a bit hacky and might not be the right solution, but it worked):

Change --initial-cluster-state=new argument to --initial-cluster-state=existing on the nodes where etcd component is failing.

ishantanu on 30 Nov 2018

I'm experiencing the same, or very similar, issue.

And here's what I did to resolve it (a bit hacky and might not be the right solution, but it worked):
Change --initial-cluster-state=new argument to --initial-cluster-state=existing on the nodes where etcd component is failing.

Where do you add these flags?

magick93 on 3 Jan 2019

There should be no manual intervention needed to setup a X node etcd cluster, let me know if this still reproduces on latest release.

superseb on 13 Mar 2019

piwi91 on 11 Nov 2019

👍4

Issue : etcd-0 component status returns unhealthy on rancher cluster.

I followed this step:

kubectl get cs

NAME AGE
controller-manager
scheduler
etcd-0
etcd-2

etcd-1

kubectl describe cs etcd-0

https://xxx.xx.xx.181:2379/health: dial tcp xxx.xx.xx:2379: connect: connection refused

Connect to node ip xxx.xx.xx.181
sudo docker ps

189f88bb2b0f rancher/coreos-etcd:v3.3.10-rancher1 "/usr/local/bin/etcd…" 10 months ago Up About a minute etcd

check container status
sudo docker inspect 189f88bb2b0f

If container status it was exited then start it and again checked
kubectl describe cs etcd-0

Name: etcd-0
Namespace:
Labels:
Annotations:
API Version: v1
Conditions:
Message: {"health":"true"}
Status: True
Type: Healthy
Kind: ComponentStatus
Metadata:
Creation Timestamp:
Self Link: /api/v1/componentstatuses/etcd-0
Events:

It works fine...