Rke: Failed to get /health for host - remote error: tls: bad certificate

Created on 1 Apr 2019 · 23Comments · Source: rancher/rke

Getting Failed to get /health for host - remote error: tls: bad certificate when trying to upgrade an existing cluster. No modification to certificates have been done.

RKE version:
rke version v0.2.1

Docker version:

Client:
 Version:           18.06.3-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        d7080c1
 Built:             Wed Feb 20 02:27:18 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.3-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       d7080c1
  Built:            Wed Feb 20 02:26:20 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
16.04.4 LTS (Xenial Xerus) 4.4.0-116-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ESXi Virtual Machine

cluster.yml file:

nodes:
  - address: 10.10.7.121
    user: daniel
    role: [controlplane,worker,etcd]
  - address: 10.10.7.122
    user: daniel
    role: [controlplane,worker,etcd]
  - address: 10.10.7.123
    user: daniel
    role: [controlplane,worker,etcd]

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Steps to Reproduce:
./rke -d up

Results:
`... DEBU[0028] [remove/rke-log-linker] Container doesn't exist on host [10.10.7.123] DEBU[0028] [etcd] Checking image [rancher/rke-tools:v0.1.27] on host [10.10.7.123] DEBU[0028] Checking if image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] DEBU[0028] Image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] DEBU[0028] [etcd] No pull necessary, image [rancher/rke-tools:v0.1.27] exists on host [10.10.7.123] INFO[0029] [etcd] Successfully started [rke-log-linker] container on host [10.10.7.123] DEBU[0029] [remove/rke-log-linker] Checking if container is running on host [10.10.7.123] DEBU[0029] [remove/rke-log-linker] Removing container on host [10.10.7.123] INFO[0029] [remove/rke-log-linker] Successfully removed container on host [10.10.7.123] DEBU[0029] [etcd] Successfully created log link for Container [etcd] on host [10.10.7.123] INFO[0029] [etcd] Successfully started etcd plane.. Checking etcd cluster health DEBU[0029] [etcd] Check etcd cluster health DEBU[0029] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate DEBU[0034] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate DEBU[0039] Failed to get /health for host [10.10.7.121]: Get https://10.10.7.121:2379/health: remote error: tls: bad certificate DEBU[0044] [etcd] Check etcd cluster health DEBU[0045] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate DEBU[0050] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate DEBU[0055] Failed to get /health for host [10.10.7.122]: Get https://10.10.7.122:2379/health: remote error: tls: bad certificate DEBU[0060] [etcd] Check etcd cluster health DEBU[0060] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate DEBU[0065] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate DEBU[0070] Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate FATA[0075] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy

kinbug

Source

danielbjornadal

All 23 comments

Seeing same behavior on the following:

Docker version:
```Client:
Client:
Version: 18.09.3
API version: 1.39
Go version: go1.10.8
Git commit: 774a1f4
Built: Thu Feb 28 06:33:21 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.3
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 774a1f4
Built: Thu Feb 28 06:02:24 2019
OS/Arch: linux/amd64
Experimental: true
```

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Red Hat Enterprise Linux Server release 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64

scottdoane on 1 Apr 2019

I am seeing the same error as well.

tlvenn on 2 Apr 2019

Same here. Broke my Cluster with that.
Downgraded rke to 0.1.17. And re-up'ed my config with placed-working-kube_config. Works again.

Edit:
Tested with an experimental cluster:
Even a newly generated cluster (rke 0.2.1 + k8s 13.5) fails with error tls: bad-certificate when same cluster.yml was run a second time against the same cluster with rke up --config cluster.yml.

Docker: 18.06 on RHEL 7.5

ChrisHaPunkt on 2 Apr 2019

Exactly same Problem with our RKE Setup.

mruepp on 4 Apr 2019

Same here. We had to go back to 0.1.17 to upgrade.

remche on 8 Apr 2019

A user has also reported this with RKE v0.2.0 in case #3824, although I have been unable to reproduce at this time.

axeal on 9 Apr 2019

Per conversation with @Oats87 have now reproduced a cause for this error upon attempted upgrade of cluster via RKE v0.2.0 or v0.2.1.

If the kube_config_<file>.yml file is absent from the local directory when you perform rke up RKE treats the cluster as new rather than a legacy cluster, which will result in the [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy fatal error, with debug messages of the format Failed to get /health for host [10.10.7.123]: Get https://10.10.7.123:2379/health: remote error: tls: bad certificate.

Reproducer

Instantiate a simple single node cluster with rke up using RKE v0.1.7
Remove the kube_config_<file>.yml file
Attempt to upgrade cluster via rke -d up using RKE v0.2.0 or v0.2.1
Observe [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy error with /health: remote error: tls: bad certificate messages.

Workaround
Upon encountering this issue as a result of the missing kube_config_<file>.yml during upgrade, the following workaround can be used:

# Remove your `<file>.rkestate` file

# Log into all of your control plane nodes and run:

rm -f /etc/kubernetes/ssl/kube-service-account-token-key.pem
rm -f /etc/kubernetes/ssl/kube-service-account-token.pem
cp /etc/kubernetes/ssl/kube-apiserver-key.pem /etc/kubernetes/ssl/kube-service-account-token-key.pem
cp /etc/kubernetes/ssl/kube-apiserver.pem /etc/kubernetes/ssl/kube-service-account-token.pem

# Run an `rke up` with RKE 0.1.17

# Run an `rke up` with RKE 0.2.0/0.2.1

axeal on 10 Apr 2019

👍1

@axeal the workaround is missing the additional step of "Remove your kube_config_<file>.yml file" at the beginning, so that when you run the rke up with 0.1.x RKE re-generates a valid kube_config_<file>.yml

Oats87 on 10 Apr 2019

Upgrading from RKE 0.1.16 to RKE 0.2.1 #3824 initially failed possibly due to cluster.yaml name change.
On consecutive attempts of creating a new cluster with RKE 0.1.16 followed by upgrading it with 0.2.1 we had the same error mentioned. Once we removed the .rkestate` file as suggested by @axeal the issue was resolved.

segeva on 11 Apr 2019

The PR should prevent this behavior by checking if the kubeconfig is missing and whether its a legacy cluster or not, if kubeconfig is missing and the cluster turned out to be a legacy cluster RKE will fail with the following error:

This is a legacy cluster with no kube config, aborting upgrade. Please re-run rke up with rke 0.1.x to retrieve correct state

We didn't allow rke 0.2 to handle this situation by fetching state from nodes because it will open up a lot of unnecessary edge cases to deal with.

QA verification steps:

1- create RKE cluster with version 0.1.x
2- rename kube_config_cluster.yml to kube_config_cluster.yml.old
3- run RKE version 0.2.x on this cluster

Expected Result

RKE should fail with the error above ^, to restore the cluster state run rke 0.1.x again and it will recreate the kubeconfig successfully

galal-hussein on 24 Apr 2019

Can be tested with rancher/rancher:v2.2.3-rc2

alena1108 on 26 Apr 2019

➤ Jack Luo commented:

Wait for an RKE release to include the fix (https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10 ( https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10|smart-link ) )

jira-sync-svc on 26 Apr 2019

Can be tested with standalone rke v0.2.3-rc1

alena1108 on 27 Apr 2019

rke v0.2.3-rc1 does not include the fix https://github.com/rancher/rke/commit/7a0406c44fac163139b2dab22a4f4d47a96e4b10

Wait for a new RKE release to validate the fix.

cc @alena1108

jiaqiluo on 29 Apr 2019

@jiaqiluo can be tested with v0.2.3-rc2

alena1108 on 4 May 2019

The bug fix is validated on rke v0.2.3-rc2

Following the step from the above comment (https://github.com/rancher/rke/issues/1244#issuecomment-485991428)

create RKE cluster with version 0.1.x
rename kube_config_cluster.yml to kube_config_cluster.yml.old
run RKE version 0.2.x on this cluster

see the following error message as expected:

FATA[0006] This is a legacy cluster with no kube config, aborting upgrade. Please re-run rke up with rke 0.1.x to retrieve correct state

jiaqiluo on 4 May 2019

I got the same error with v0.2.4

wusphinx on 27 Jun 2019

I have same error with latest rke v0.2.4
Have HA cluster everything works great just when I want to add new node rke up --update-only its fails with "remote error: tls: bad certificate"

branislav-brujic on 2 Jul 2019

I have same error with latest rke v0.2.4， when I want to add new node rke up --update-only its fails with "Failed to get /health for host : Get https://xxx.xxx.xx.xx:2379/health: remote error: tls: bad certificate"

getefuxing on 6 Aug 2019

I got the same error with v0.3.2, when I want to add new node rke up --update-only .
message:
...
INFO[0015] [etcd] Successfully started etcd plane.. Checking etcd cluster health
DEBU[0015] [etcd] Check etcd cluster health
DEBU[0015] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
DEBU[0020] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
DEBU[0025] Failed to get /health for host [192.168.9.14]: Get https://192.168.9.14:2379/health: remote error: tls: bad certificate
FATA[0030] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy

MartinYangTW on 30 Oct 2019

We're having the same error as @MartinYangTW Any advice on how to handle this error?

Retrospector on 11 Nov 2019

I'm having the same error rke version v1.1.5-rc5, on ubuntu 18.04 x86_64, Docker version 19.03.6, build 369ce74a3c

arkanmgerges on 13 Aug 2020

I had the same issue using version v1.0.4 but my problem was solved by @axeal's answer. However, after deleting my .rkestate I got this error and had to recreate it. This script might be handy if someone needs to recreate it.