Getting Failed to get /health for host - remote error: tls: bad certificate when trying to upgrade an existing cluster. No modification to certificates have been done.
RKE version:
rke version v0.2.1
Docker version:
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:27:18 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:26:20 2019
OS/Arch: linux/amd64
Experimental: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
16.04.4 LTS (Xenial Xerus) 4.4.0-116-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ESXi Virtual Machine
cluster.yml file:
nodes:
- address: 10.10.7.121
user: daniel
role: [controlplane,worker,etcd]
- address: 10.10.7.122
user: daniel
role: [controlplane,worker,etcd]
- address: 10.10.7.123
user: daniel
role: [controlplane,worker,etcd]
services:
etcd:
snapshot: true
creation: 6h
retention: 24h
Steps to Reproduce:
./rke -d up
Results:
`
...
DEBU[0028] [remove/rke-log-linker] Container doesn't exist on host [10.10.7.123]
DEBU[0028] [etcd] Checking image [rancher/rke-tools:v0.1.27] on host [10.10.7.123]
DEBU[0028] Checking if image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123]
DEBU[0028] Image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123]
DEBU[0028] [etcd] No pull necessary, image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123]
INFO[0029] [etcd] Successfully started [rke-log-linker] container on host [10.10.7.123]
DEBU[0029] [remove/rke-log-linker] Checking if container is running on host [10.10.7.123]
DEBU[0029] [remove/rke-log-linker] Removing container on host [10.10.7.123]
INFO[0029] [remove/rke-log-linker] Successfully removed container on host [10.10.7.123]
DEBU[0029] [etcd] Successfully created log link for Container [etcd] on host [10.10.7.123]
INFO[0029] [etcd] Successfully started etcd plane.. Checking etcd cluster health
DEBU[0029] [etcd] Check etcd cluster health
DEBU[0029] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate
DEBU[0034] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate
DEBU[0039] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate
DEBU[0044] [etcd] Check etcd cluster health
DEBU[0045] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate
DEBU[0050] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate
DEBU[0055] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate
DEBU[0060] [etcd] Check etcd cluster health
DEBU[0060] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate
DEBU[0065] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate
DEBU[0070] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate
FATA[0075] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy
Seeing same behavior on the following:
Docker version:
```Client:
Client:
Version: 18.09.3
API version: 1.39
Go version: go1.10.8
Git commit: 774a1f4
Built: Thu Feb 28 06:33:21 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.3
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 774a1f4
Built: Thu Feb 28 06:02:24 2019
OS/Arch: linux/amd64
Experimental: true
```
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Red Hat Enterprise Linux Server release 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64
I am seeing the same error as well.
Same here. Broke my Cluster with that.
Downgraded rke to 0.1.17. And re-up'ed my config with placed-working-kube_config. Works again.
Edit:
Tested with an experimental cluster:
Even a newly generated cluster (rke 0.2.1 + k8s 13.5) fails with error tls: bad-certificate when same cluster.yml was run a second time against the same cluster with rke up --config cluster.yml.
Docker: 18.06 on RHEL 7.5
Exactly same Problem with our RKE Setup.
Same here. We had to go back to 0.1.17 to upgrade.
A user has also reported this with RKE v0.2.0 in case #3824, although I have been unable to reproduce at this time.
Per conversation with @Oats87 have now reproduced a cause for this error upon attempted upgrade of cluster via RKE v0.2.0 or v0.2.1.
If the kube_config_<file>.yml file is absent from the local directory when you perform rke up RKE treats the cluster as new rather than a legacy cluster, which will result in the [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy fatal error, with debug messages of the format Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate.
Reproducer
rke up using RKE v0.1.7kube_config_<file>.yml filerke -d up using RKE v0.2.0 or v0.2.1[etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy error with /health: remote error: tls: bad certificate messages.Workaround
Upon encountering this issue as a result of the missing kube_config_<file>.yml during upgrade, the following workaround can be used:
# Remove your `<file>.rkestate` file
# Log into all of your control plane nodes and run:
rm -f /etc/kubernetes/ssl/kube-service-account-token-key.pem
rm -f /etc/kubernetes/ssl/kube-service-account-token.pem
cp /etc/kubernetes/ssl/kube-apiserver-key.pem /etc/kubernetes/ssl/kube-service-account-token-key.pem
cp /etc/kubernetes/ssl/kube-apiserver.pem /etc/kubernetes/ssl/kube-service-account-token.pem
# Run an `rke up` with RKE 0.1.17
# Run an `rke up` with RKE 0.2.0/0.2.1
@axeal the workaround is missing the additional step of "Remove your kube_config_<file>.yml file" at the beginning, so that when you run the rke up with 0.1.x RKE re-generates a valid kube_config_<file>.yml
Upgrading from RKE 0.1.16 to RKE 0.2.1 #3824 initially failed possibly due to cluster.yaml name change.
On consecutive attempts of creating a new cluster with RKE 0.1.16 followed by upgrading it with 0.2.1 we had the same error mentioned. Once we removed the
The PR should prevent this behavior by checking if the kubeconfig is missing and whether its a legacy cluster or not, if kubeconfig is missing and the cluster turned out to be a legacy cluster RKE will fail with the following error:
This is a legacy cluster with no kube config, aborting upgrade. Please re-run rke up with rke 0.1.x to retrieve correct state
We didn't allow rke 0.2 to handle this situation by fetching state from nodes because it will open up a lot of unnecessary edge cases to deal with.
QA verification steps:
1- create RKE cluster with version 0.1.x
2- rename kube_config_cluster.yml to kube_config_cluster.yml.old
3- run RKE version 0.2.x on this cluster
Expected Result
RKE should fail with the error above ^, to restore the cluster state run rke 0.1.x again and it will recreate the kubeconfig successfully
Can be tested with rancher/rancher:v2.2.3-rc2
➤ Jack Luo commented:
Wait for an RKE release to include the fix (https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10 ( https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10|smart-link ) )
Can be tested with standalone rke v0.2.3-rc1
rke v0.2.3-rc1 does not include the fix https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10
Wait for a new RKE release to validate the fix.
cc @alena1108
@jiaqiluo can be tested with v0.2.3-rc2
The bug fix is validated on rke v0.2.3-rc2
Following the step from the above comment (https://github.com/rancher/rke/issues/1244#issuecomment-485991428)
see the following error message as expected:
FATA[0006] This is a legacy cluster with no kube config, aborting upgrade. Please re-run rke up with rke 0.1.x to retrieve correct state
I got the same error with v0.2.4
I have same error with latest rke v0.2.4
Have HA cluster everything works great just when I want to add new node rke up --update-only its fails with "remote error: tls: bad certificate"
I have same error with latest rke v0.2.4, when I want to add new node rke up --update-only its fails with "Failed to get /health for host : Get https://xxx.xxx.xx.xx:2379/health: remote error: tls: bad certificate"
I got the same error with v0.3.2, when I want to add new node rke up --update-only .
message:
...
INFO[0015] [etcd] Successfully started etcd plane.. Checking etcd cluster health
DEBU[0015] [etcd] Check etcd cluster health
DEBU[0015] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
DEBU[0020] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
DEBU[0025] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
FATA[0030] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy
We're having the same error as @MartinYangTW Any advice on how to handle this error?
I'm having the same error rke version v1.1.5-rc5, on ubuntu 18.04 x86_64, Docker version 19.03.6, build 369ce74a3c
I had the same issue using version v1.0.4 but my problem was solved by @axeal's answer. However, after deleting my .rkestate I got this error and had to recreate it. This script might be handy if someone needs to recreate it.