RKE version:
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Docker version: (docker version,docker info preferred)
Containers: 34
Running: 31
Paused: 0
Stopped: 3
Images: 23
Server Version: 17.03.2-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 190
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-1067-aws
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858 GiB
Name: ip-172-31-5-156
ID: EVYS:MBKF:63RZ:GDEU:MFSU:LLMN:NHLZ:N3YP:SPT2:COMZ:5JT7:H5UC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
ubuntu@ip-172-31-5-156:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
AWS
cluster.yml file:
nodes:
- address: ec2-18-981-98-xxx.us-east-2.compute.amazonaws.com
internal_address: 172.81.1.86
user: ubuntu
role: [controlplane,worker,etcd]
ssh_key_path: rke-keypair.pem
- address: ec2-18-191-161-xxx.us-east-2.compute.amazonaws.com
internal_address: 172.31.3.176
user: ubuntu
role: [controlplane,worker,etcd]
ssh_key_path: rke-keypair.pem
- address: ec2-18-123-22-xxx.us-east-2.compute.amazonaws.com
internal_address: 172.21.0.123
user: ubuntu
role: [controlplane,worker,etcd]
ssh_key_path: rke-keypair.pem
cloud_provider:
name: aws
Steps to Reproduce:
rke up --config rancher-cluster.ymlkubectl get cs Results:
etcd component's status returns as unhealthy.
➜ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-2 Healthy {"health": "true"}
etcd-1 Healthy {"health": "true"}
etcd-0 Unhealthy Get https://172.31.5.156:2379/health: net/http: TLS handshake timeout
What's more interesting is if I run kubectl get cs again, etcd-1 returns as unhealthy
➜ kubectl get cs
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
etcd-2 Healthy {"health": "true"}
etcd-1 Unhealthy Get https://172.31.5.156:2379/health: net/http: TLS handshake timeout
But etcd-2 never returns unhealthy.
I'd appreciate if anyone can help me to understand what is going on with this.
On inspection of etcd logs, here's what I found:
2018-10-24 07:55:00.821282 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[5cd651bd2233737e]=b7c2d12a1db23c20, local=b80c2a51ca9c90a4)
2018-10-24 07:55:00.912345 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[d6b104e2cd4233e7]=b7c2d12a1db23c20, local=b80c2a51ca9c90a4)
@iamShantanu101 are there any finite steps to reproduce the issue? If yes, could you please provide them
So here's what I did:
worker,etcd,controlplane.rke remove.rancher-cluster.yml config filerke up --config rancher-cluster.ymlTo further explain the issue, I faced this issue twice so far. From my inspection:
etcd container on these nodes was in a Restarting state.etcdmain: member 32d3fdedbe9b75fb has already been bootstrappedAnd here's what I did to resolve it (a bit hacky and might not be the right solution, but it worked):
--initial-cluster-state=new argument to --initial-cluster-state=existing on the nodes where etcd component is failing.I'm experiencing the same, or very similar, issue.
And here's what I did to resolve it (a bit hacky and might not be the right solution, but it worked):
Change --initial-cluster-state=new argument to --initial-cluster-state=existing on the nodes where etcd component is failing.
Where do you add these flags?
There should be no manual intervention needed to setup a X node etcd cluster, let me know if this still reproduces on latest release.
@superseb I can reproduce this. By adding a controleplane node to an existing cluster the new node won't join the existing ETCD cluster because the ETCD node is created with a new cluster ID because of the --initial-cluster-state=new flag
EDIT: This only occurs when you first remove a controlplane node from a cluster and then re-add it without removing the /var/lib/etcd data directory. I resolved it by removing the /var/lib/etcd directory and then re-adding the node in the cluster.
I followed this step:
kubectl get cs
NAME AGE
controller-manager
scheduler
etcd-0
etcd-2
kubectl describe cs etcd-0
Connect to node ip xxx.xx.xx.181
sudo docker ps
189f88bb2b0f rancher/coreos-etcd:v3.3.10-rancher1 "/usr/local/bin/etcd…" 10 months ago Up About a minute etcd
check container status
sudo docker inspect 189f88bb2b0f
If container status it was exited then start it and again checked
kubectl describe cs etcd-0
Name: etcd-0
Namespace:
Labels:
Annotations:
API Version: v1
Conditions:
Message: {"health":"true"}
Status: True
Type: Healthy
Kind: ComponentStatus
Metadata:
Creation Timestamp:
Self Link: /api/v1/componentstatuses/etcd-0
Events:
It works fine...
Most helpful comment
@superseb I can reproduce this. By adding a controleplane node to an existing cluster the new node won't join the existing ETCD cluster because the ETCD node is created with a new cluster ID because of the
--initial-cluster-state=newflagEDIT: This only occurs when you first remove a controlplane node from a cluster and then re-add it without removing the
/var/lib/etcddata directory. I resolved it by removing the/var/lib/etcddirectory and then re-adding the node in the cluster.