Openshift-ansible: etcd started with error: transport: remote error: tls: bad certificate; please retry

Created on 14 Nov 2017  ยท  11Comments  ยท  Source: openshift/openshift-ansible

Description

Try to install OpenShift Origin using openshift-ansible playbook with branch release-3.6. I got the following error at enable and start origin-master step during the installation process:

Failure summary:


  1. Hosts:    master.example.com
     Play:     Configure masters
     Task:     restart master
     Message:  Unable to restart service origin-master: Job for origin-master.service failed because a timeout was exceeded. See "systemctl status origin-master.service" and "journalctl -xe" for details.

Check service log of origin-master, and get many tls:handshake timeout errors.

then check etcd logs using journalctl -u etcd -lf and got:

-- Logs begin at ๆ—ฅ 2017-10-15 05:53:50 CST. --
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: 785ad3e4f1b9ce8b became leader at term 2
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: raft.node: 785ad3e4f1b9ce8b elected leader 785ad3e4f1b9ce8b at term 2
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: setting up the initial cluster version to 3.2
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: published {Name:master.example.com ClientURLs:[https://192.168.123.155:2379]} to cluster 74619f9d53805edf
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: ready to serve client requests
11ๆœˆ 14 14:52:10 master.example.com systemd[1]: Started Etcd Server.
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: set the initial cluster version to 3.2
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: enabled capabilities for version 3.2
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: serving client requests on 192.168.123.155:2379
11ๆœˆ 14 14:52:10 master.example.com etcd[31068]: Failed to dial 192.168.123.155:2379: connection error: desc = "transport: remote error: tls: bad certificate"; please retry.
Version

Please put the following version information in the code block
indicated below.

  • Your ansible version per ansible --version
ansible 2.4.0.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Aug  4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]

If you're operating from a git clone:

  • The output of git describe
openshift-ansible-3.6.173.0.75-1
Steps To Reproduce
  1. run ansible-playbook /usr/share/openshift-ansible/playbooks/byo/config.yml
  2. [step 2]
Expected Results

Describe what you expected to happen.

Example command and output or error messages
Observed Results

Describe what is actually happening.

Example command and output or error messages

For long output or logs, consider using a gist

Additional Information

Provide any additional information which may help us diagnose the
issue.

  • Your operating system and version, ie: RHEL 7.2, Fedora 23 ($ cat /etc/redhat-release)
  • Your inventory file (especially any non-standard configuration parameters)
  • Sample code, etc
#
# ansible hosts file 
#
# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd
lb

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=root

# If ansible_ssh_user is not root, ansible_become must be set to true
#ansible_become=true

openshift_deployment_type=origin

# Specify the generic release of OpenShift to install. This is used mainly just during installation, after which we
# rely on the version running on the first master. Works best for containerized installs where we can usually
# use this to lookup the latest exact version of the container images, which is the tag actually used to configure
# the cluster. For RPM installations we just verify the version detected in your configured repos matches this
# release.
openshift_release=v3.6.1

# Specify an exact container image tag to install or configure.
# WARNING: This value will be used for all hosts in containerized environments, even those that have another version installed.
# This could potentially trigger an upgrade and downtime, so be careful with modifying this value after the cluster is set up.
openshift_image_tag=v3.6.1

# Specify an exact rpm version to install or configure.
# WARNING: This value will be used for all hosts in RPM based environments, even those that have another version installed.
# This could potentially trigger an upgrade and downtime, so be careful with modifying this value after the cluster is set up.
openshift_pkg_version=-3.6.1

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]

#openshift_repos_enable_testing=true
openshift_disable_check=disk_availability,docker_storage
#docker_selinux_enabled=false
#openshift_docker_options=" --log-driver=journald --storage-driver=overlay "

# Alternate image format string, useful if you've got your own registry mirror
# Configure this setting just on node or master
#oreg_url_master=example.com/openshift3/ose-${component}:${version}
#oreg_url_node=example.com/openshift3/ose-${component}:${version}
# For setting the configuration globally
oreg_url=registry.example.com:30000/openshift/origin-${component}:${version}
# If oreg_url points to a registry other than registry.access.redhat.com we can
# modify image streams to point at that registry by setting the following to true
openshift_examples_modify_imagestreams=true
openshift_docker_additional_registries=registry.example.com:30000
openshift_docker_insecure_registries=registry.example.com:30000

openshift_hosted_manage_registry=false

# OpenShift Router Options
# Router selector (optional)
# Router will only be created if nodes matching this label are present.
# Default value: 'region=infra'
openshift_hosted_router_selector='region=infra,router=true'

# default subdomain to use for exposed routes
openshift_master_default_subdomain=app.example.com

# host group for masters
[masters]
master.example.com

# host group for etcd
[etcd]
master.example.com

# Load balancers
[lb]
lb.example.com

# host group for nodes, includes region info
[nodes]
master.example.com openshift_schedulable=true openshift_node_labels="{'region': 'infra', 'router': 'true'}"
node01.example.com openshift_schedulable=true openshift_node_labels="{'region': 'infra', 'router': 'true'}"
node02.example.com openshift_schedulable=true openshift_node_labels="{'region': 'infra', 'router': 'true'}"

lifecyclrotten

Most helpful comment

@NoxHarmonium nope, all i use is ipv4 only. and recently i found the main cause is not this etcd bad certificate issue, but the port conflict between master and lb. I setup master and lb sharing one server node.

All 11 comments

More logs from journalctl as below:

[root@master ~]# journalctl -xe -u etcd -l
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = , trusted-ca = /etc/etcd/ca.crt, cl
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: listening for peers on https://192.168.123.155:2380
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: listening for client requests on 192.168.123.155:2379
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: name = master.example.com
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: data dir = /var/lib/etcd/
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: member dir = /var/lib/etcd/member
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: heartbeat = 500ms
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: election = 2500ms
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: snapshot count = 100000
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: advertise client URLs = https://192.168.123.155:2379
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: initial advertise peer URLs = https://192.168.123.155:2380
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: initial cluster = master.example.com=https://192.168.123.155:2380
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: starting member 785ad3e4f1b9ce8b in cluster 74619f9d53805edf
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: 785ad3e4f1b9ce8b became follower at term 0
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: newRaft 785ad3e4f1b9ce8b [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: 785ad3e4f1b9ce8b became follower at term 1
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: simple token is not cryptographically signed
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: starting server... [version: 3.2.7, cluster version: to_be_decided]
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: ClientTLS: cert = /etc/etcd/server.crt, key = /etc/etcd/server.key, ca = , trusted-ca = /etc/etcd/ca.c
11ๆœˆ 14 16:35:13 master.example.com etcd[92962]: added member 785ad3e4f1b9ce8b [https://192.168.123.155:2380] to cluster 74619f9d53805edf
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: 785ad3e4f1b9ce8b is starting a new election at term 1
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: 785ad3e4f1b9ce8b became candidate at term 2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: 785ad3e4f1b9ce8b received MsgVoteResp from 785ad3e4f1b9ce8b at term 2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: 785ad3e4f1b9ce8b became leader at term 2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: raft.node: 785ad3e4f1b9ce8b elected leader 785ad3e4f1b9ce8b at term 2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: setting up the initial cluster version to 3.2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: set the initial cluster version to 3.2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: enabled capabilities for version 3.2
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: published {Name:master.example.com ClientURLs:[https://192.168.123.155:2379]} to cluster 74619f
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: ready to serve client requests
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: serving client requests on 192.168.123.155:2379
11ๆœˆ 14 16:35:15 master.example.com systemd[1]: Started Etcd Server.
-- Subject: Unit etcd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has finished starting up.
--
-- The start-up result is done.
11ๆœˆ 14 16:35:15 master.example.com etcd[92962]: Failed to dial 192.168.123.155:2379: connection error: desc = "transport: remote error: tls: bad certi
lines 1232-1271/1271 (END)

I am having the same issue. etcd is unable to start and the OpenShift installation is halted with the message:

Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.

Log outputs (hostnames and IPs have been redacted):

$ systemctl status -l origin-master-api.service
...
Nov 27 11:37:04 master.example.com openshift[16145]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [AAAA::BBBB::CCCC:DDDD:EEEE:FFFF]:2379: getsockopt: connection refused"; Reconnecting to {master.example.com:2379 <nil>}
...
$ journalctl -u etcd -lf
...
Nov 27 08:25:18 master.example.com systemd[1]: Started Etcd Server.
Nov 27 08:25:18 master.example.com etcd[3090]: ready to serve client requests
Nov 27 08:25:18 master.example.com etcd[3090]: serving client requests on 192.168.123.123:2379
Nov 27 08:25:19 master.example.com etcd[3090]: Failed to dial 192.168.123.123:2379: connection error: desc = "transport: remote error: tls: bad certificate"; please retry.
...

I think that the common denominator among the people who are encountering this issue is that they all have a single etcd node instead of a cluster.

I have tried with the latest master branch and also with the release-3.7 branch and got the same error. I have also tried redeploying the certificates.

The certificates seem valid to me and include all the relevant IP addresses and hosts. I will post them just in case it helps.

I will try and have a look if there is some difference in the certificate logic when there is a single host compared to a multi-host setup but I am not very experienced with Ansible.

This is my inventory file for reference:

[OSEv3:children]
masters
nodes

[OSEv3:vars]
ansible_ssh_user=ansible
ansible_become=true
dynamic_volumes_check=False
openshift_metrics_cassandra_storage_type=dynamic
openshift_logging_storage_kind=dynamic
openshift_deployment_type=origin
openshift_disable_check=memory_availability,disk_availability

openshift_master_identity_providers=[{'name': 'google', 'challenge': 'false', 'login': 'true', 'mappingMethod': 'claim', 'kind': 'GoogleIdentityProvider', 'clientID': 'xxxxxxxxx', 'clientSecret': 'xxxxxxx', 'hostedDomain': 'example.com'}]

openshift_master_named_certificates=[{"certfile": "some.cert", "keyfile": "some.key", "names": ["external.example.com", "*.external.example.com"]}]
openshift_master_overwrite_named_certificates=true

openshift_hosted_registry_storage_kind=object'
openshift_hosted_registry_storage_provider=s3
openshift_hosted_registry_storage_s3_accesskey=xxxxxxxxxx
openshift_hosted_registry_storage_s3_secretkey= xxxxxxxxxx
openshift_hosted_registry_storage_s3_bucket= xxxxxxxxxx
openshift_hosted_registry_storage_s3_region=us-west-1

[masters]
master.example.com openshift_public_hostname=external.example.com openshift_ip=192.168.111.444

[etcd]
master.example.com openshift_public_hostname=external.example.com openshift_ip=192.168.111.444

[nodes]
master.example.com openshift_public_hostname=external.example.com openshift_node_labels="{'region': 'infra', 'zone': 'west'}" openshift_ip=192.168.111.444
node1.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_ip=192.168.111.333
node2.example.com openshift_node_labels="{'region': 'primary', 'zone': 'west'}" openshift_ip=192.168.111.222
$ openssl x509 -in /etc/origin/master/etcd.server.crt -text -noout

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 6 (0x6)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=openshift-signer@1511772597
        Validity
            Not Before: Nov 27 08:49:59 2017 GMT
            Not After : Nov 27 08:50:00 2019 GMT
        Subject: CN=172.30.0.1
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:e0:61:3b:c0:41:ed:0e:58:00:f4:99:ef:20:35:
                    5d:c3:f8:d7:81:70:9d:c0:46:0f:6e:f0:9e:f9:35:
                    92:67:b6:b5:93:7f:8e:8b:fd:19:43:fd:74:a8:85:
                    fb:96:4e:6f:5c:ba:b1:47:0e:88:39:17:f4:77:2f:
                    2b:98:57:c0:fa:cd:94:52:33:ec:d5:da:c1:6a:e7:
                    ed:54:6f:65:84:46:33:8e:67:b9:29:e4:63:b9:c2:
                    b1:7d:37:ce:4b:fb:ee:df:77:b1:f7:61:ba:4f:cb:
                    29:07:95:fb:73:e0:fe:28:28:85:a3:c1:c8:ef:17:
                    4d:52:f9:5c:a0:21:c8:ad:c3:fa:52:8f:91:db:15:
                    a6:66:b0:10:94:37:f3:ae:44:5b:b1:95:19:73:67:
                    d0:60:1a:d7:75:e7:db:de:9c:57:5d:52:b1:ad:f1:
                    18:1b:e0:4d:a4:ee:22:6f:b5:69:8c:91:a2:e8:9a:
                    f6:5a:d6:da:fe:a1:69:d3:29:fc:be:ce:98:ce:2f:
                    9c:46:99:65:c3:83:b5:72:be:3c:1a:83:ce:18:c6:
                    dd:63:09:aa:d2:8e:68:4e:30:7c:84:87:70:e9:8d:
                    f8:49:a3:80:69:ea:92:24:40:31:f4:42:8c:ef:11:
                    82:ac:86:47:f3:1b:13:07:67:2e:22:65:67:7e:a6:
                    f3:5d
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name:
                DNS:external.example.com, DNS:master.example.com, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, DNS:openshift, DNS:openshift.default, DNS:openshift.default.svc, DNS:openshift.default.svc.cluster.local, DNS:172.30.0.1, DNS:192.168.xxx.xx, DNS:74.207.xxx.x, IP Address:172.30.0.1, IP Address:192.168.xxx.xx, IP Address:74.207.xxx.x
    Signature Algorithm: sha256WithRSAEncryption
         4c:18:c7:15:27:3a:d8:d2:8f:b2:6f:f2:d1:27:91:1c:25:bc:
         a8:21:41:df:72:46:3c:00:c8:96:36:8b:70:77:db:f5:c9:27:
         98:57:d3:73:a9:af:23:23:26:29:b3:64:25:67:3a:f5:44:0c:
         8a:34:f6:79:ee:e4:c1:51:77:27:ed:c0:86:c6:e8:06:2a:08:
         a3:3a:9f:5a:22:1d:a1:55:81:c6:cd:76:98:e9:ed:cf:35:a5:
         7a:69:38:f6:ce:4e:e3:79:dc:8f:22:ee:62:25:e2:34:7d:26:
         33:2e:23:f0:1d:9c:e4:c2:95:84:39:85:54:0e:dd:ff:1c:62:
         51:d6:98:2a:0c:fe:8b:c5:01:b3:f6:2c:1f:51:6b:06:f9:23:
         86:fc:fe:85:e3:51:8c:99:5b:71:c9:a8:ee:15:f0:90:61:a4:
         a4:89:f2:cd:7f:49:db:e6:d0:8c:e6:d7:96:cb:d5:80:56:8a:
         43:7c:4b:57:8d:62:39:9f:d2:fa:fa:64:94:a3:14:fa:41:5a:
         23:55:4f:85:25:e2:ed:97:49:2b:e7:ae:f2:e7:84:91:f8:d0:
         6e:bb:6a:7d:1e:c1:6f:9b:df:6d:3e:9f:75:9e:d7:c6:2c:ee:
         bb:ac:f7:5b:74:85:e0:94:e6:f2:a5:fa:e1:51:7b:ef:a0:c6:
         71:89:bb:4f
$ openssl x509 -in /etc/origin/master/master.etcd-client.crt  -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 3 (0x3)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=etcd-signer@1511770342
        Validity
            Not Before: Nov 27 08:23:53 2017 GMT
            Not After : Nov 26 08:23:53 2022 GMT
        Subject: CN=master.example.com
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:ae:18:45:dc:62:6f:53:d2:9a:3b:80:bc:a9:97:
                    13:5f:d6:c6:18:44:f1:7e:29:16:f3:72:1f:83:4a:
                    0c:d7:6c:0d:b8:9c:cb:a8:03:cc:ab:0c:93:19:87:
                    c7:d3:25:9d:46:60:34:04:60:fc:d5:de:de:a3:43:
                    ff:db:67:d8:2e:6d:c4:89:7a:c6:84:f1:26:27:eb:
                    8c:6c:4f:42:52:99:d0:9e:98:f2:b9:c0:4d:e0:2d:
                    98:0a:8e:70:6c:a2:f7:40:92:4d:ee:c7:af:fe:3a:
                    65:d8:97:e5:b8:b5:92:b8:87:aa:9c:0f:0b:b8:9c:
                    25:47:d4:e7:8c:ff:3c:36:f2:0f:fd:1a:7f:17:75:
                    f1:eb:e0:03:3e:9f:4b:c1:8b:93:1f:85:b5:d7:6b:
                    de:df:7a:87:d4:fc:15:53:72:52:51:53:c2:98:ee:
                    85:05:91:6d:59:1f:0c:bb:4d:e9:1a:c9:a3:c0:a4:
                    04:12:1c:d5:c9:99:ce:8b:bf:41:88:ca:a2:d8:bb:
                    4f:26:11:98:a7:b8:e1:d3:45:07:2c:65:35:0f:94:
                    6d:af:dd:e9:b0:49:22:34:26:5d:0a:29:1a:33:00:
                    5a:2c:f7:74:d2:20:f9:fa:e5:d4:a2:ce:c7:94:4e:
                    da:1d:d6:36:d6:65:6e:ee:71:6c:f0:fd:ad:8d:14:
                    95:1f
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Authority Key Identifier:
                keyid:D5:EF:AA:49:64:1C:65:27:43:15:EE:97:21:D0:2E:83:05:FA:A3:D9
                DirName:/CN=etcd-signer@1511770342
                serial:B9:BA:70:10:23:D5:EA:0D

            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Extended Key Usage:
                TLS Web Client Authentication
            X509v3 Key Usage:
                Digital Signature, Key Encipherment
            X509v3 Subject Key Identifier:
                B7:BB:15:D0:68:AB:73:DD:77:D1:7B:3D:39:02:BD:98:B0:FE:FC:D3
            X509v3 Subject Alternative Name:
                IP Address:192.168.xxx.xx, DNS:master.example.com
    Signature Algorithm: sha256WithRSAEncryption
         2b:08:53:a1:ff:06:0b:d5:17:9f:89:75:1d:95:eb:2f:16:8e:
         1b:52:f6:a8:0d:1e:f6:2f:82:01:59:85:9b:61:da:4b:85:78:
         66:49:5a:98:a1:b8:4e:fc:dd:d4:18:a0:52:bc:44:34:30:8e:
         64:b9:22:a7:d0:57:69:2f:ba:1b:d1:00:b4:a8:9b:0f:0e:dd:
         ac:31:b4:de:c4:a3:3c:0c:86:98:07:e8:2f:6f:21:4a:96:e5:
         c3:c1:a1:4a:27:e2:4d:07:89:60:6d:c0:ae:b9:85:a1:63:0a:
         fb:5d:47:5e:0c:39:d1:97:4d:76:a3:1c:cf:95:38:8c:cb:05:
         17:a0:d5:6f:9b:e9:93:74:56:89:f1:d6:b4:82:40:a8:d0:1e:
         55:3c:dc:7c:dd:87:03:61:28:f0:0e:95:9d:55:a2:53:d5:af:
         2c:a7:2d:f4:f2:4f:0f:97:78:3d:98:4b:9b:d5:8b:44:fd:59:
         eb:ad:a9:8c:c3:62:c1:44:48:0b:98:1f:28:fe:f4:b6:03:7a:
         08:9e:ec:bf:f3:3c:fc:9f:2a:8c:ef:c8:ac:b6:4a:94:8c:c8:
         9d:9e:68:51:0e:82:60:a4:92:3c:5c:52:b5:e0:e7:fa:9f:cd:
         9f:97:0d:5b:ba:08:d1:38:23:e6:8f:16:1c:50:55:67:bc:b3:
         8a:64:7d:a9:4f:7b:e5:55:5c:7f:6b:50:55:35:86:f3:7c:5f:
         78:b2:f0:94:5e:21:73:32:97:8a:68:0d:1c:2c:54:79:c8:fa:
         0f:34:e7:72:7d:0b:8f:d9:5f:70:02:2f:fa:11:43:d9:3e:44:
         f7:0a:99:73:0f:1e:9e:44:9a:67:1f:97:51:16:be:38:21:61:
         2a:8a:86:e1:e1:fc:f4:29:9e:35:9c:af:7e:1c:0b:fb:9d:1f:
         bb:d2:c1:0d:46:32:48:15:fe:f8:38:27:5f:e2:4c:d7:34:ae:
         66:22:6f:d4:bd:e3:3d:da:5f:22:67:80:f5:2d:d9:d7:d4:64:
         b3:00:c9:29:09:41:60:d8:bc:ef:22:72:8d:a5:5b:38:55:f0:
         19:e2:bb:a8:5a:ae:c0:0d:c2:3e:03:c8:2e:9e:df:2c:28:0d:
         37:b8:28:e2:9a:30:b8:66:14:2c:c1:ee:fd:de:bc:5e:2c:d7:
         3d:e6:fe:02:07:8c:1f:b7:a8:53:b6:48:d0:ea:06:ea:30:3e:
         1e:13:c8:1d:3b:7a:73:e4:d0:15:40:5e:be:d8:94:44:c1:4d:
         5b:a2:f2:de:9b:b4:96:3d:95:e9:8a:14:6f:3e:e5:73:52:be:
         3d:0a:0e:fb:73:00:24:9b:26:69:d5:12:29:e9:71:09:05:49:
         73:39:3a:c8:0d:be:5f:1f

Let me know if I can provide any more information to help! I will try and do some more investigation.

This also seems to be a similar, if not the same problem to this issue: https://github.com/openshift/openshift-ansible/issues/6087

I have done some more investigation and found that:

  1. If I start the etcd server manually with:
sudo /usr/bin/etcd --name=master.example.com --data-dir=/var/lib/etcd/ --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt  --listen-client-urls=https://192.168.xxx.xx:2379 --advertise-client-urls=https://192.168.xxx.xx:2379 --debug

I don't get a cert error when the server starts.

  1. Even with the cert error the etcd server is still in the running/active state. It does not crash.
  2. When I manually do a health check with
etcdctl -C https://${ETCD_CA_HOST}:2379 \
           --ca-file=/etc/etcd/ca.crt  \
           --cert-file=/etc/etcd/peer.crt \
           --key-file=/etc/etcd/peer.key \
           cluster-health

If ETCD_CA_HOST is localhost or the hostname of my master the connection fails
If ETCD_CA_HOST is openshift_ip of my master it works, whether there is a certificate error or not.

Therefore I think what is happening is that the etcd daemon is being set up to listen on the private openshift_ip but the origin-master daemon is trying to connect to etcd using the master's public hostname which resolves to a different IP to openshift_ip and the etcd daemon refuses the connection.

@eliu Does your master host name resolve to an IPv6 address by any chance? Another theory I have is that certain services are listening on the masters's IPv4 address but the hostname is resolving to an IPv6 address so the connection is getting refused. I should be able to test this theory soon.

I managed to solve my problem by doing the following:

  1. I removed AAAA/IPv6 records from my DNS records.
  2. I created DNS entries for the internal IPs in my network and used the openshift_hostname to specify them in the inventory file.
master.example.com openshift_ip=192.168.xxx.xx openshift_hostname=master.internal.example.com

This ensured that when the the origin-master-api.service tried to connect to the etcd daemon it resolved to the private IP that the etcd daemon was listening on, not the public IP of the master.

It seems that the openshift_ip parameter is ignored by the master-api when resolving services.

This has not fixed the transport: remote error: tls: bad certificate message but the etcd daemon seems to work regardless and the installation completed successfully.

@NoxHarmonium nope, all i use is ipv4 only. and recently i found the main cause is not this etcd bad certificate issue, but the port conflict between master and lb. I setup master and lb sharing one server node.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings