Kubespray: etcd cluster is unavailable or misconfigured: connection refused

Created on 13 May 2018 · 65Comments · Source: kubernetes-sigs/kubespray

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG

Environment:

Cloud provider or hardware configuration:
VMware Fusion

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Linux 3.10.0-862.2.3.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Version of Ansible (ansible --version):
ansible 2.5.2

Kubespray version (commit) (git rev-parse --short HEAD):
REL v2.5.0
commit: 02cd5418
Network plugin used:
default: calico

Copy of your inventory file:

[all]
node1    ansible_host=192.168.140.191 ip=192.168.140.191
node2    ansible_host=192.168.140.192 ip=192.168.140.192
node3    ansible_host=192.168.140.193 ip=192.168.140.193

[kube-master]
node1
node2

[kube-node]
node1
node2
node3

[etcd]
node1
node2
node3

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

[vault]
node1
node2
node3

Command used to invoke ansible:
ansible-playbook --flush-cache -u myuser -b -i inventory/mycluster/hosts.ini cluster.yml

Output of ansible run:

Errors in trend of:
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []} fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []}

and:

fatal: [node2]: FAILED! => {"attempts": 10, "changed": false, "content": "", "msg": "Status code was -1 and not [200]: Request failed: <urlopen error ('_ssl.c:563: The handshake operation timed out',)>", "redirected": false, "status": -1, "url": "https://192.168.140.192:2379/health"}

Anything else do we need to know:

Firewall disabled, ssh access + root priv work for ansible, sudo swapoff -a

lifecyclrotten

Source

MaxCCC

Most helpful comment

Run on master nodes:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd —reload

Run no all nodes:

firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp

btw, SELinux is working fine, i did not had to do any adjustments or disable it

ArieLevs on 12 Jul 2018

👍21 🎉4 ❤1

All 65 comments

It's fixed in this PR #2577

woopstar on 16 May 2018

👎9

I am facing same issue TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************
Thursday 17 May 2018 13:53:52 +0000 (0:00:00.508) 0:05:09.972 ***
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node2]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027081", "end": "2018-05-17 13:54:35.677364", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:29.650283", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.021069", "end": "2018-05-17 13:54:51.668894", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:45.647825", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.035036", "end": "2018-05-17 13:54:52.431413", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:46.396377", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *********************************************
to retry, use: --limit @/etc/ansible/roles/kubespray/cluster.retry

PLAY RECAP *************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0
node1 : ok=177 changed=11 unreachable=0 failed=1
node2 : ok=173 changed=11 unreachable=0 failed=1
node3 : ok=173 changed=11 unreachable=0 failed=1
node4 : ok=152 changed=9 unreachable=0 failed=0
node5 : ok=149 changed=9 unreachable=0 failed=0
node6 : ok=149 changed=9 unreachable=0 failed=0
node7 : ok=149 changed=9 unreachable=0 failed=0
node8 : ok=149 changed=9 unreachable=0 failed=0
node9 : ok=149 changed=9 unreachable=0 failed=0

Thursday 17 May 2018 13:54:52 +0000 (0:01:00.404) 0:06:10.376 **

etcd : Configure | Check if etcd cluster is healthy -------------------------------------------------------------------------------------------------- 60.40s
gather facts from all instances ---------------------------------------------------------------------------------------------------------------------- 18.74s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------ 17.78s
kubernetes/preinstall : install growpart ------------------------------------------------------------------------------------------------------------- 14.13s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 8.41s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 7.29s
etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------------------- 6.78s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.79s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.70s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.63s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.61s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.60s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.46s
docker : Ensure old versions of Docker are not installed. | RedHat ------------------------------------------------------------------------------------ 5.24s
etcd : Configure | Check if etcd-events cluster is healthy -------------------------------------------------------------------------------------------- 4.71s
kubernetes/preinstall : Hosts | populate inventory into hosts file ------------------------------------------------------------------------------------ 4.21s
download : container_download | Download containers if pull is required or told to always pull (all nodes) -------------------------------------------- 3.82s
docker : ensure docker packages are installed --------------------------------------------------------------------------------------------------------- 3.64s
kubernetes/preinstall : Update package management cache (YUM) - Redhat -------------------------------------------------------------------------------- 3.29s
kubernetes/preinstall : Create kubernetes directories ------------------------------------------------------------------------------------------------- 2.82s
[integrationteam@IntegrationTeam-Ansible-Vm1 kubespray]$

manunmathew on 17 May 2018

RHEL 7.2 ansible node
ansible==2.4.2.0
RHEL 7.5 Master and agent node

manunmathew on 17 May 2018

Same issue with Ubuntu 16.04.4
Ansible 2.5.1

dlifanov on 22 May 2018

👍6

Same issue using Vagrant with default configuration on current master

Vagrant 2.1.1
ansible 2.5.3

DukeHarris on 23 May 2018

👍4

Had exactly same issue here,
using Centos7 3.10.0-693.21.1.el7.x86_64
ansible 2.5.4

Well your cluster is actually up and running its just that the health check is failing, here is my workaround.

You can manually check your cluster by providing cert and key values

SSH to one of the nodes and:

etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health

You should get successful check for all cluster members

Just remove/comment from all tasks:

Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy

Same health check is later failing in playbook for Calico | wait for etcd

You can also check that by doing

curl --cert /etc/ssl/etcd/ssl/member-node1.pem --key /etc/ssl/etcd/ssl/member-node1-key.pem https://127.0.0.1:2379/health

So just remove/comment all playbook task

Calico | wait for etcd

Hope this gets fixed soon, wasted a lot of time to figure this out

ArieLevs on 8 Jun 2018

👍18

They checks:

Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy

Should not fail if the cluster is healthy and the certificates are present to check. Removing the checks are not a solution at all.

woopstar on 9 Jun 2018

👍4

after investigating this, the unique way to replicate the issue in my case was using incorrect no_proxy env settings and http_proxy var in /etc/environment

I have just removed http_proxy in /etc/environment and fixed no_proxy environment.

example:

no_proxy: "localhost,127.0.0.1,.local.domain,10.3.0.1,10.3.0.2,10.3.0.4,10.3.0.5,10.3.0.6" # no_proxy for subnets is ignored

You must have all yours host IPs in the no_proxy when using proxy.

This was my case, don't know if it is yours.

There is also strange empty when here that I have removed in my tests:

https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/tasks/main.yml#L9

- include_tasks: "gen_certs_{{ cert_management }}.yml"
  when:
  tags:
    - etcd-secrets

pablodav on 10 Jun 2018

👍6 🎉1

just to update, the above error is probably due to firewalld issue
on dev env just stop and disable firewalld service
on production open all relevant ports (2379, 2380 etc...)

running on Centos 7 Linux 4.17.3-1.el7.elrepo.x86_64 x86_64

ArieLevs on 28 Jun 2018

I am having the same issue. Disabled firewall and it does not help. Running on CentOS 7

mikimtm on 11 Jul 2018

Actually with firewalld disabled it seems that it is starting to work. Does anyone know full list of ports I need to open?

mikimtm on 11 Jul 2018

Run on master nodes:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd —reload

Run no all nodes:

firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp

btw, SELinux is working fine, i did not had to do any adjustments or disable it

ArieLevs on 12 Jul 2018

👍21 🎉4 ❤1

Same issue here
With Centos 7

fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.5.70:2379,https://192.168.5.71:2379,https://192.168.5.72:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.239063", "end": "2018-07-16 15:27:23.188711", "msg": "non-zero return code", "rc": 1, "start": "2018-07-16 15:27:20.949648", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid\n; error #1: x509: certificate has expired or is not yet valid\n; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout\n\nerror #0: x509: certificate has expired or is not yet valid\nerror #1: x509: certificate has
expired or is not yet valid\nerror #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid", "; error #1: x509: certificate has expired or is not yet valid", "; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "", "error #0: x509: certificate has expired or is not yet valid", "error #1: x509: certificate has expired or is not yet valid", "error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

my configuration

> `[all]
> node1    ansible_host=192.168.5.70 ip=192.168.5.70
> node2    ansible_host=192.168.5.71 ip=192.168.5.71
> node3    ansible_host=192.168.5.72 ip=192.168.5.72
> 
> [kube-master]
> node1
> 
> [kube-node]
> node2
> node3
> 
> [etcd]

> node1
> node2
> node3
> 
> [k8s-cluster:children]
> kube-node
> kube-master
> 
> [calico-rr]
> 
> [vault]
> node1
> node2
> node3

fauzan-n on 16 Jul 2018

Think I might have the same issue and can't figure out why.
I do get both the connection error as well as complaint on ca-cert's being selfsigned.

There is a task:

- name: Gen_certs | update ca-certificates (Debian/Ubuntu/Container Linux by CoreOS)
  command: update-ca-certificates
  when: etcd_ca_cert.changed

Not sure it's working as expected. It is successful but there is no update-ca-certificate script on my installation (CoreOS7).

So I'm also stuck on waiting for etcd health status to check out ok.

Will try the workaround disabling the check task for now. Noticed the update-ca-certificate is part of the overlay filesystem of the etcd container. Should that task really be run on the node?

frippe75 on 15 Aug 2018

This issue was stopping deployment in 2.6.0, but with 2.7.0 my Ubuntu 18.04 cluster gets deployed. However, there is still an etcd health check failing (it is ignored). As per @ArieLevs , I can confirm, running the etcdctl check with the certs on the command line works. I think the root cause of this error is NOT a firewall issue (although that has the same symptoms), it is a self-signed cert error. If you run the etcdctl in debug mode without the certs, it complains: error #0: remote error: tls: bad certificate

The offending check is in file kubespray/roles/etcd/tasks/configure.yml as follows:

name: Configure | Check if etcd cluster is healthy
  shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
  register: etcd_cluster_is_healthy
  until: etcd_cluster_is_healthy.rc == 0
  retries: 4
  delay: "{{ retry_stagger | random + 3 }}"
  ignore_errors: false
  changed_when: false
  check_mode: no
  when: is_etcd_master and etcd_cluster_setup
  tags:
    - facts
  environment:
    ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
    ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"

I believe the environment variables here are not respected by etcdctl. A better way to do this (and that works) is in the Calico configuration where the certs are passed in via the command line as follows:

- name: Calico | wait for etcd
  uri:
    url: "{{ etcd_access_addresses.split(',') | first }}/health"
    validate_certs: no
    client_cert: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}.pem"
    client_key: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}-key.pem"
  register: result
  until: result.status == 200 or result.status == 401
  retries: 10
  delay: 5
  run_once: true

hdave on 3 Oct 2018

👍2

Alternatively, if you want to feed the cli arguments to the shell task:

- name: Configure | Check if etcd cluster is healthy shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} --cert-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem cluster-health | grep -q 'cluster is healthy'" register: etcd_cluster_is_healthy ignore_errors: true changed_when: false check_mode: no when: is_etcd_master and etcd_cluster_setup tags: - facts environment: ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem" ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"

truncj on 25 Oct 2018

👍4 👎1

I'm having the same issue with default vars using vagrant.

I've tried to verify etcd cluster health with admin/member certificates still gets request exceeded error.
Is there any progress with this?

ChiKenNeg on 31 Oct 2018

same issue and behavior with Ubuntu 16.04.
Ansible version 2.6.6

$ etcdctl --debug cluster-health
Cluster-Endpoints: http://127.0.0.1:4001, http://127.0.0.1:2379
cURL Command: curl -X GET http://127.0.0.1:4001/v2/members
cURL Command: curl -X GET http://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
; error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout

error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout

I have disabled the ufw but no luck. As mentioned above when I try to edit my inventory.cfg or hosts.ini file to use only one etcd, it will not work also.

My understanding from this weird behavior and from debugging it are the following:

This can happen when the etcd node addresses ('endpoints') are not published or are incorrect (when you try to curl the endpoints you will have the same issue).
Basically, this could happen due to the etcd docker container, the etcd2 is running in the container and it is not exposing the ports to the host os (etcd ports: 4001, 2380 and 2379) which is normal while the container was starting with the --net host option so the container will run on the host network.
When you try to stop etcd as a service or rm the etcd running container, after some seconds you can see that the etcd cluster is healthy and available but then it will back to the same issue.
Try to refrain uging the --no-sync option. Example: etcdctl --no-sync --endpoint http://ip:2379 set /hello world.
Kill the process of etcd and try to run it manually usind etcd2 (the token is to get from https://discovery.etcd.io/new?size=1)

kill -9 "$(ps aux | grep etcd | grep -v grep | sed 's/^[^ ][^ ]*[ ][ ]*\([0-9][0-9]*\).*$/\1/g')"
etcd2 --name infra1 --initial-advertise-peer-urls http://10.0.0.101:2380 \
  --listen-peer-urls http://IP:2380 \
  --listen-client-urls http://IP:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://IP:2379 \
  --discovery https://discovery.etcd.io/<token>

etcdctl --debug cluster-health

I tried also to enable the firewall again and accept the traffic from the etcd ports
iptables -I INPUT -p tcp -m tcp --dport 2379 -j ACCEPT && iptables -I INPUT -p tcp -m tcp --dport 2380 -j ACCEPT

Finally, I had to do this; delete the old etcd docker image and gcr.io/google_containers/cluster-proportional-autoscaler-amd64 to prevent k8s from getting back the old image of etcd and then I had to run the docker image manually including the ssl and certificates path and changing the behavior of the etcd docker container so it runs without the --net host option and get an IP from the docker0 interface, then expose the needed ports.

docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
 --name etcd quay.io/coreos/etcd:v2.3.8 \
 -name etcd0 \
 -advertise-client-urls http://IP:2379,http://IP:4001 \
 -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
 -initial-advertise-peer-urls http://IP:2380 \
 -listen-peer-urls http://0.0.0.0:2380 \
 -initial-cluster-token etcd-cluster-1 \
 -initial-cluster etcd0=http://IP1:2380,etcd1=http://IP2:2380,etcd2=http://IP3:2380 \
 -initial-cluster-state new

I am still looking for any workaround/solution for this.

nzoueidi on 11 Nov 2018

❤1 👍1

I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).

The above was failing every time. When I switched the setup to 1 bastion and 1 master and 2 nodes (neither master nor node having the floating IP associated), then after a little fiddling with the inventory/sample/no-floating.yml and moving it into correct inventory/$CLUSTER/ directory and running both terraform and ansible from the root of the kubespray git repo ... magic happened and the cluster was up and running without any further issue.

To conclude I would find nice to have an automated test for OpenStack deployment with a working setup. Even the howto guide should be slightly updated to reflect actual steps to be done.

To-Do: fix the deployment to work with DNS, w/o bastion and name-based certs (FreeIPA cert-monger would be nice?)

Eventually, I can create a pull request?

PexMor on 26 Nov 2018

👍3

@PexMor hitting the same issue, OpenStack + Ubuntu, please take a look at #2606, those changes were approved but not merged

mvasilenko on 8 Dec 2018

I also have the etcd health task failing. But what is wired if I run the task (after playbook is done) manually it works perfectly (setting the env's and calling cluster health).

As @ArieLevs said etcd seems healthy.

fentas on 25 Jan 2019

In my case I ran this successfully only after checking out release-2.8 branch instead of using master.
Used defaults and modified only hosts.ini

A note that configuration was exactly the same when tried to use release-2.8 and master

Errors that disappeared:

fatal: [k8s-1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.009745", "end": "2019-02-02 16:15:22.366223", "msg": "non-zero return code", "rc": 1, "start": "2019-02-02 16:15:22.356478", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused\n\nerror #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "", "error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

this error above was ignored and build continued

AND at the end:

fatal: [k8s-1]: FAILED! => {"msg": "The conditional check 'kube_token_auth' failed. The error was: error while evaluating conditional (kube_token_auth): 'kube_token_auth' is undefined\n\nThe error appears to have been in '/Users/music/Documents/git/kubespray/roles/kubernetes/tokens/tasks/check-tokens.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: \"Check_tokens | check if the tokens have already been generated on first master\"\n  ^ here\n"}

Ansible 2.6.0
Vagrant 2.0.4
VirtualBox 5.2.26
Ubuntu 18.04

Ran Kubespray from Mac OS El Capitan

laimison on 3 Feb 2019

I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).

This happened for me, but opening the security group to allow traffic to port 2379 from "everywhere" (laziness) on the master made it possible for itself to connect via the floating IP and the playbook could complete.

Seems to me that the solution is to either not use the floating IP or make sure that the security group allows access to it.

llarsson on 6 Feb 2019

I ran into the same problem when trying to run kubespray against 3 bare metal Centos 7.6 servers.

It turns out that I had not set up the bare metal servers properly, because the system time was not correct on the three different machines. So what was happening was that kubespray generated certificates which had a start time which was greater than the system time on 2 out of 3 of my machines.

I solved this by installing chronyd and starting it on each machine, to set the correct time on each machine. I could have also installed ntpd.

rodrigc on 14 Feb 2019

👍2

it the same issue, etcd runs only on master/controller node, other nodes it is not running, no issue with firewall - it is even not running - RHEL 7.5 in AWS and no firewalld/iptables

fatal: [machine-01]: FAILED! => {
"attempts": 4,
"changed": false,
"cmd": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'",
"delta": "0:00:02.018172",
"end": "2019-02-15 17:22:52.241082",
"invocation": {
"module_args": {
"_raw_params": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"warn": true
}
},
"msg": "non-zero return code",
"rc": 1,
"start": "2019-02-15 17:22:50.222910",
"stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\n; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\n; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused\n\nerror #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\nerror #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\nerror #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused",
"stderr_lines": [
"Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout",
"; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused",
"; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused",
"",
"error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout",
"error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused",
"error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused"
],
"stdout": "",
"stdout_lines": []
}

riponbanik on 15 Feb 2019

Hey guys,

A possible workaround for this issue is flush the iptables # iptables -F, this works for me.

setup:
CentOS Linux release 7.6.1810 (Core)
kubespray commit: a8dd69cf (git rev-parse --short HEAD)
cni: canal

vterry on 3 Apr 2019

I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?

qasim9641 on 10 Apr 2019

I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?

Have you tried flush your iptables?

vterry on 10 Apr 2019

Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places

TASK [etcd : Configure | Check if member is in etcd cluster] *****
Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 *
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}

After above error it continues the playbook but it fails at this place

TASK [etcd : Join Member | Add member to etcd cluster] *******
Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 *
FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []}
And it just stopped the playbook after this error
I am not sure how to debug this issue further :(

qasim9641 on 10 Apr 2019

Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places

TASK [etcd : Configure | Check if member is in etcd cluster] *****
Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 *
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}

After above error it continues the playbook but it fails at this place

TASK [etcd : Join Member | Add member to etcd cluster] *******
Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 *
FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []}
And it just stopped the playbook after this error
I am not sure how to debug this issue further :(

Can u shared your host.ini and your all.yml?

vterry on 10 Apr 2019

Hi vterry,
I have pasting my hosts.ini,inventory.ini and all.yml. As its not allow me to attach the files .If you can share your email I can also attach those files as well.

**I am using inventory.ini because if use the hosts.ini I am getting the error Failed to Parse
-2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini:4: Expected key=value host variable assignment, got: 192.168.19.247

File hosts.ini
all:
hosts:
node1:
access_ip: 192.168.19.247
ip: 192.168.19.247
ansible_host: 192.168.19.247
node2:
access_ip: 192.168.19.248
ip: 192.168.19.248
ansible_host: 192.168.19.248
node3:
access_ip: 192.168.19.249
ip: 192.168.19.249
ansible_host: 192.168.19.249
children:
kube-master:
hosts:
node1:
node2:
kube-node:
hosts:
node3:
node1:
node2:
etcd:
hosts:
node3:
node1:
node2:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

File inventory.ini

## Configure 'ip' variable to bind kubernetes services on a

## different ip than the default iface

## We should set etcd_member_name for etcd cluster. The node that is not a etcd member do not need to set the value, or can set the empty string value.

[all]
node1 ansible_host=192.168.19.247 ip=192.168.19.247 etcd_member_name=etcd1
node2 ansible_host=192.168.19.248 ip=192.168.19.248 etcd_member_name=etcd2
node3 ansible_host=192.168.19.249 ip=192.168.19.249 etcd_member_name=etcd3

node4 ansible_host=95.54.0.15 # ip=10.3.0.4 etcd_member_name=etcd4

node5 ansible_host=95.54.0.16 # ip=10.3.0.5 etcd_member_name=etcd5

node6 ansible_host=95.54.0.17 # ip=10.3.0.6 etcd_member_name=etcd6

## configure a bastion host if your nodes are not directly reachable

bastion ansible_host=x.x.x.x ansible_user=some_user

[kube-master]
node1
node2

[etcd]
node1
node2
node3

[kube-node]
node2
node3

[k8s-cluster:children]
kube-master
kube-node

File all.yml

Directory where etcd data stored

etcd_data_dir: /var/lib/etcd

Directory where the binaries will be installed

bin_dir: /usr/local/bin

The access_ip variable is used to define how other nodes should access

the node. This is used in flannel to allow other flannel nodes to see

this node for example. The access_ip is really useful AWS and Google

environments where the nodes are accessed remotely by the "public" ip,

but don't know about that address themselves.

access_ip: 1.1.1.1

External LB example config

apiserver_loadbalancer_domain_name: "elb.some.domain"

loadbalancer_apiserver:

address: 1.2.3.4

port: 1234

Internal loadbalancers for apiservers

loadbalancer_apiserver_localhost: true

Local loadbalancer should use this port

And must be set port 6443

nginx_kube_apiserver_port: 6443

If nginx_kube_apiserver_healthcheck_port variable defined, enables proxy liveness check.

nginx_kube_apiserver_healthcheck_port: 8081

OTHER OPTIONAL VARIABLES

For some things, kubelet needs to load kernel modules. For example, dynamic kernel services are needed

for mounting persistent volumes into containers. These may not be loaded by preinstall kubernetes

processes. For example, ceph and rbd backed volumes. Set to true to allow kubelet to load kernel

modules.

kubelet_load_modules: false

Upstream dns servers

upstream_dns_servers:

- 8.8.8.8

- 8.8.4.4

There are some changes specific to the cloud providers

for instance we need to encapsulate packets with some network plugins

If set the possible values are either 'gce', 'aws', 'azure', 'openstack', 'vsphere', 'oci', or 'external'

When openstack is used make sure to source in the openstack credentials

like you would do when using openstack-client before starting the playbook.

Note: The 'external' cloud provider is not supported.

TODO(riverzhang): https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager

cloud_provider:

Set these proxy values in order to update package manager and docker daemon to use proxies

http_proxy: ""

https_proxy: ""

Refer to roles/kubespray-defaults/defaults/main.yml before modifying no_proxy

no_proxy: ""

Some problems may occur when downloading files over https proxy due to ansible bug

https://github.com/ansible/ansible/issues/32750. Set this variable to False to disable

SSL validation of get_url module. Note that kubespray will still be performing checksum validation.

download_validate_certs: False

If you need exclude all cluster nodes from proxy and other resources, add other resources here.

additional_no_proxy: ""

Certificate Management

This setting determines whether certs are generated via scripts.

Chose 'none' if you provide your own certificates.

Option is "script", "none"

note: vault is removed

cert_management: script

Set to true to allow pre-checks to fail and continue deployment

ignore_assert_errors: false

The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.

kube_read_only_port: 10255

Set true to download and cache container

download_container: true

Deploy container engine

Set false if you want to deploy container engine manually.

deploy_container_engine: true

Set Pypi repo and cert accordingly

pyrepo_index: https://pypi.example.com/simple

pyrepo_cert: /etc/ssl/certs/ca-certificates.crt

ansible_user: tmp1
ansible_password: password
ansible_become_pass: password

qasim9641 on 10 Apr 2019

Has anyone, found a fix for this issue?

markpenner34 on 29 May 2019

I tried deploying Kubernetes in combination with WireGuard. It just didn't work. After some deeper digging, I found out, that ip (public ip) instead of access_ip (private, WireGuard ip) is used as listening address for etcd.

This commit in my fork fixed it for me: https://github.com/bagbag/kubespray/commit/209eb8a5118bd61a178cd08b7d802100dfd4e32e

bagbag on 29 May 2019

👍3

@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles

execute on master nodes:

firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload

execute on all nodes:

firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload

If installing Calico open these ports on all nodes:

firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload

and it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)

ArieLevs on 29 May 2019

👍3

hi @ArieLevs I don't have any firwalld installed on the servers. Running ubuntu 16.04 Latest version of kubespray and ansible 2.7.10.

Failing on roles etcd/configure specifically health checks.

error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused

error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

Any help would be appreciated.

markpenner34 on 30 May 2019

@markpenner34 i've noticed that etcd issue regarding port 4001 appears to occur on Ubuntu (while port 4001 is legacy and should not be used from etcd documentation)

What happens if you ssh to node1 and execute

etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health

Try ports 4001 and 2379 (the certificate file paths may be different on Ubuntu, as this command was executed on Centos, change to relevant paths if needed)

btw, the response of \x15\x03\x01\x00\x02\x02 means a non https request

ArieLevs on 30 May 2019

@ArieLevs

node 1 is not an etc node, this is my cluster.yml is this correct?

https://pastebin.com/403G71pC

When i run sudo lsof -i:2379 on the etcd nodes i can see that their are no ports listening.

However when i run that on the docker container - running etc. I can see the ports are listening to correctly

markpenner34 on 30 May 2019

@markpenner34
The config files look different for me, i use the official from https://github.com/kubernetes-sigs/kubespray#usage

So my inventory.ini only contain (everything else is commented out)

[k8s-cluster:children]
kube-master
kube-node

And the nodes information is declared at hosts.yml file
I'm sorry i cannot assist too much, as i've never deployed k8s (using kubespray) on ubuntu.

ArieLevs on 30 May 2019

same issue on:

Ubuntu 18.04
Kubespray 1.13.5
ansible 2.5.1
we are also using kube-proxy - which is using ipvs to setup the kubelet network connections between nodes...

csayler on 14 Jun 2019

hi we are trying to create etcd cluster but we are facing the following error please find below error

Error: client: etcd cluster is unavailable or misconfigured
error #0: dial tcp 192.168.2.139:2379: getsockopt: connection refused

please help us to get out of this

thanks in advance

we are following the below link to implement the kubernetes on bare metal

https://medium.com/faun/configuring-ha-kubernetes-cluster-on-bare-metal-servers-with-kubeadm-1-2-1e79f0f7857b

aneeshwara on 26 Jun 2019

My test as follows:

Ubuntu host using Vagrant with kubespray master branch.

Captioned issue resulted:
fatal: [k8s-1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379,https://172.17.8.102:2379,https://172.17.8.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027821", "end": "2019-07-23 00:28:29.583827", "msg": "non-zero return code", "rc": 1, "start": "2019-07-23 00:28:23.556006", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\n; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\n; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout\n\nerror #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\nerror #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\nerror #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "", "error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
VM's have 2 network interfaces: eth0 for public network and eth1 for private network. Issue fixed if access_ip is assigned to public network ip and use access_ip instead of ip as etcd_address.

ok: [k8s-1] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.715343", "end": "2019-07-23 01:28:09.521148", "rc": 0, "start": "2019-07-23 01:28:04.805805", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [k8s-2] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:01.726660", "end": "2019-07-23 01:28:09.588888", "rc": 0, "start": "2019-07-23 01:28:07.862228", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [k8s-3] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.637851", "end": "2019-07-23 01:28:09.587249", "rc": 0, "start": "2019-07-23 01:28:04.949398", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Is this a proper fix?

ewtang on 23 Jul 2019

I'm also experiencing this problem while using kubespray to try to deploy a 2-node Kubernetes cluster on OpenStack instances running Ubuntu 18.04.

How to reproduce:

create 2 OpenStack instances running Ubuntu 18.04
follow the instructions in Kubespray's Quick Start section setting ip with the node's private IP and access_ip with the node's floating IP, and also the node's ansible_user.
run ansible-playbook as stated in the Quick Start guide.

Here are the contents of ./inventory/mycluster/hosts.yml:

all:
  hosts:
    node1:
      ansible_user: myuser
      ansible_host: 185.178.87.56
      ip: 192.168.0.8
      access_ip: 185.178.87.56
    node2:
      ansible_user: myuser
      ansible_host: 185.178.87.47
      ip: 192.168.0.9
      access_ip: 185.178.87.47
  children:
    kube-master:
      hosts:
        node1:
    kube-node:
      hosts:
        node1:
        node2:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

Result:

TASK [etcd : Configure | Check if etcd cluster is healthy] **************************************************************************************************************************************************************
Thursday 01 August 2019  15:47:02 +0100 (0:00:00.023)       0:02:44.977 ******* 
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://185.178.87.56:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018130", "end": "2019-08-01 14:47:30.642472", "msg": "non-zero return code", "rc": 1, "start": "2019-08-01 14:47:28.624342", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout\n\nerror #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "", "error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *****************************************************************************************************************************
    to retry, use: --limit @/home/rmam/development/CORDS/other/creodias_kubespray/kubespray/cluster.retry

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
node1                      : ok=462  changed=12   unreachable=0    failed=1   
node2                      : ok=312  changed=9    unreachable=0    failed=0

ruimaciel on 1 Aug 2019

For my test with vagrant provider=libvirt, the problem turned out to be that ip address 172.17.8.1 of the private (virtual) network is occasionally used as src ip in TLS handshake instead of the host ip 172.17.8.10x of the etcd cluster nodes.

<network ipv6='yes'>
  <name>kubespray0</name>
  <uuid>a502bbbb-7118-4e4a-8443-7ae1195dc93d</uuid>
  <forward mode='nat'/>
  <bridge name='virbr2' stp='on' delay='0'/>
  <mac address='52:54:00:43:3f:ac'/>
  <ip address='172.17.8.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='172.17.8.1' end='172.17.8.254'/>
    </dhcp>
  </ip>
</network>

The workaround in such case is to add the relevant ip to the following setting:

etcd_cert_alt_ips: [172.17.8.1]

ewtang on 8 Aug 2019

👍1

Hi, we are facing the same issue while deploying a 3 master 2 worker Kubernetes cluster on Azure.

Kubernetes Version: 1.15.3
Node Type : Azure VM
OS : CoreOs 1967.6.0
Kubespray : release-2.11
ETCD Version: 3.3.10

Surprisingly, cluster set up worked fine few days back with this code.

Any help would be greatly appreciated.

Following is the etcd status on a Node

$ etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member.pem --key-file=/etc/ssl/etcd/ssl/member-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused

error #0: dial tcp 127.0.0.1:2379: connect: connection refused

RajdeepSardar on 18 Sep 2019

ETCD gets messed up with wrong IP address and keeps crashing

ETCD_INITIAL_CLUSTER=master01=https://x.y.z.47:2380,master02=https://x.y.z.46:2380,master03=https://x.y.z.45:2380
...

Sep 19 02:37:43 kubemaster01 etcd[12077]: 2019-09-19 02:37:43.056677 C | etcdmain: listen tcp x.y.z.47:2380: bind: cannot assign requested address
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Failed with result 'exit-code'.

while the ip address of masters are different

[all]
kubemaster01    ansible_host=x.y.z.44  node_name=kubemaster01 etcd_member_name=master01
kubemaster02    ansible_host=x.y.z.45  node_name=kubemaster02 etcd_member_name=master02
kubemaster03    ansible_host=x.y.z.43  node_name=kubemaster03 etcd_member_name=master03
kubenode01    ansible_host=x.y.z.47      node_name=kubenode01
kubenode02    ansible_host=x.y.z.46      node_name=kubenode02
bastion101 ansible_host=bastion101

[bastion]
bastion101

[master]
kubemaster01
kubemaster02
kubemaster03

[etcd]
kubemaster01
kubemaster02
kubemaster03

[node]
kubenode01
kubenode02


[k8s-cluster:children]
master
node

[kube-master:children]
master

[kube-node:children]
node


[calico-rr]

[vault]
kubemaster01
kubemaster02
kubemaster03

RajdeepSardar on 19 Sep 2019

We found the root case : The issue was caused due to Wrong Cache files of Ansible.

IP of VM changed when we recreated VM using Terraform scripts

Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.

RajdeepSardar on 19 Sep 2019

👍1

moved this comment to https://github.com/kubernetes-sigs/kubespray/issues/5118#issuecomment-533837327 as i think it is actually that bug and not this one

timhughes on 22 Sep 2019

Please check if docker is installed on vagrant host. If so, please uninstall and reboot. Then try again.

ewtang on 22 Sep 2019

etcd 无法启动

经过重启etcd 观察日志发现，etcd监控了2379 端口，但是却无法访问

深入观察发现kube-proxy 占用了2379端口，因此推测有人启动了一个svc ， nodeport 占了2379端口

iptables-save >a
cat a

搜索2379 能看到是什么nodeport 占用了
根据kube-proxy 可以知道，如果nodeport的endpoint 没有启动，会写一个 -j reject的规则来拒绝

因此即便kube-proxy 没有占用2379 ， etcd 监听的2379 也会被拒绝

解决方法：
思路：让etcd 启动，干掉那个nodeport sv

停止 kubelet
删除所有容器
将kube-proxy 的镜像改名
将仓库地址暂时换个不好用的

3和4 的目的是让kube-proxy 无法启动，不能改iptables

iptables -F
这一步刷掉所有kube-proxy的 iptables 规则

systemctl start kubelet
然后干掉那个svc
再把kube-proxy 弄回来

sunxingyu on 13 Nov 2019

👍2 👎1

@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles

execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload
and it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)

Thank you, bro =)
May be Ansible should be run this (add the firewall rules)?
I've spend many time to find the solution. I think kubespray do everything that I need to install K8S.

GRomR1 on 5 Dec 2019

Same error. In my condition, it`s because of the ntp is not updated, and just sync the time in all nodes by "ntpdate time.windows.com". After that, it works.

NoviceZeng on 2 Jan 2020

you can try to reset --data_dir.

YYRise on 20 Jan 2020

@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles

execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload
and it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)

@GRomR1
Thank you. This helped a lot. I ran into issues with one of the RHEL clusters (different that my first cluster where kubespray just worked fine without any hitches). However in the second cluster, I ran into the no-route/connection-refused issues ... drove me crazy until I saw your post ... and manually opened up the ports as you described and that worked. Thanks a lot!

conradf7 on 10 Mar 2020

We found the root case : The issue was caused due to Wrong Cache files of Ansible.

IP of VM changed when we recreated VM using Terraform scripts

Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.

Thanks for this! I was rebuilding my cluster with Terraform as well. It seems that JSON fact_caching is enabled in ansible.conf.

fact_caching = jsonfile
fact_caching_connection = /tmp

https://docs.ansible.com/ansible/latest/plugins/cache/jsonfile.html

Default timeout is 24 hours.
fact_caching_timeout = 86400

If you are re-deploying hosts with the same hostname, but with different IP address', within the 24 hour timeout you should expect this error (unless you clear cache in /tmp).

gmherb on 14 Mar 2020

We found the root case : The issue was caused due to Wrong Cache files of Ansible.

IP of VM changed when we recreated VM using Terraform scripts

Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.

I'm having the same issue on Azure VM with KubeSpray 2.12.3

Are you using weave as net plugin?

I've already cleaned up the caches and the issue is still here. :(

Even if I try from the inside the VM:
/usr/local/bin/etcdctl --no-sync --endpoints=https://127.0.0.1:2379 cluster-health | grep -q 'cluster is healthy'
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout

Any ideas?

growitbr on 16 Mar 2020

I have the same problem! Everything works until etcd is tested:

I use the basic Kubespray script from the release version 2.12.5 (I also tested the Master branch and 2.12.4 same issue) on 3 x Ubuntu 18.04 instances on an openstack cloud. I use terraform to construct the VMs and a basic Security group enabling the following ports:

# Security Group
resource "opentelekomcloud_compute_secgroup_v2" "secgroup" {
  name        = "secgroup"
  description = "Security group for the Terraform example instances"

  rule {
    from_port   = 22
    to_port     = 22
    ip_protocol = "tcp"
    cidr        = "0.0.0.0/0"
  }

  rule {
    from_port   = 80
    to_port     = 80
    ip_protocol = "tcp"
    cidr        = "0.0.0.0/0"
  }

  rule {
    from_port   = 8081
    to_port     = 8081
    ip_protocol = "tcp"
    cidr        = "0.0.0.0/0"
  }

  rule {
    from_port   = 8080
    to_port     = 8080
    ip_protocol = "tcp"
    cidr        = "0.0.0.0/0"
  }

  rule {
    from_port   = 2397
    to_port     = 2397
    ip_protocol = "tcp"
    cidr        = "0.0.0.0/0"
  }

  rule {
    from_port   = -1
    to_port     = -1
    ip_protocol = "icmp"
    cidr        = "0.0.0.0/0"
  }
}

[Deployment Machine in the same Subnet]
Output of the standard Kubespray deploay script via ansible:

TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************
Saturday 28 March 2020  13:32:03 +0000 (0:00:00.381)       0:07:20.485 ********
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.012101", "end": "2020-03-28 13:32:09.021030", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.008929", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout\n; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\n; error #2: dial tcp 192.168.1.101:2379: connect: connection refused\n\nerror #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout\nerror #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\nerror #2: dial tcp 192.168.1.101:2379: connect: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "; error #2: dial tcp 192.168.1.101:2379: connect: connection refused", "", "error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "error #2: dial tcp 192.168.1.101:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [node2]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.011963", "end": "2020-03-28 13:32:09.075109", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.063146", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\n; error #2: dial tcp 192.168.1.102:2379: connect: connection refused\n\nerror #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\nerror #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\nerror #2: dial tcp 192.168.1.102:2379: connect: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "; error #2: dial tcp 192.168.1.102:2379: connect: connection refused", "", "error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "error #2: dial tcp 192.168.1.102:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [node3]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.012279", "end": "2020-03-28 13:32:09.106577", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.094298", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.103:2379: connect: connection refused\n; error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n; error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout\n\nerror #0: dial tcp 192.168.1.103:2379: connect: connection refused\nerror #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout\nerror #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.103:2379: connect: connection refused", "; error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "; error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "", "error #0: dial tcp 192.168.1.103:2379: connect: connection refused", "error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
...ignoring

TASK [etcd : Configure | Check if etcd-events cluster is healthy] ************************************************************
Saturday 28 March 2020  13:32:09 +0000 (0:00:05.332)       0:07:25.817 ********

TASK [etcd : include_tasks] **************************************************************************************************
Saturday 28 March 2020  13:32:09 +0000 (0:00:00.144)       0:07:25.962 ********
included: /home/ubuntu/kubespray-2.12.5/roles/etcd/tasks/refresh_config.yml for node1, node2, node3

TASK [etcd : Refresh config | Create etcd config file] ***********************************************************************
Saturday 28 March 2020  13:32:09 +0000 (0:00:00.223)       0:07:26.186 ********
changed: [node1]
changed: [node2]
changed: [node3]

TASK [etcd : Refresh config | Create etcd-events config file] ****************************************************************
Saturday 28 March 2020  13:32:11 +0000 (0:00:01.923)       0:07:28.109 ********

TASK [etcd : Configure | Copy etcd.service systemd file] *********************************************************************
Saturday 28 March 2020  13:32:11 +0000 (0:00:00.143)       0:07:28.253 ********
changed: [node1]
changed: [node2]
changed: [node3]

TASK [etcd : Configure | Copy etcd-events.service systemd file] **************************************************************
Saturday 28 March 2020  13:32:12 +0000 (0:00:00.843)       0:07:29.096 ********

TASK [etcd : Configure | reload systemd] *************************************************************************************
Saturday 28 March 2020  13:32:12 +0000 (0:00:00.150)       0:07:29.246 ********
ok: [node1]
ok: [node2]
ok: [node3]

TASK [etcd : Configure | Ensure etcd is running] *****************************************************************************
Saturday 28 March 2020  13:32:13 +0000 (0:00:00.754)       0:07:30.001 ********
changed: [node1]
changed: [node2]
changed: [node3]

TASK [etcd : Configure | Ensure etcd-events is running] **********************************************************************
Saturday 28 March 2020  13:32:14 +0000 (0:00:00.925)       0:07:30.926 ********

TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************
Saturday 28 March 2020  13:32:14 +0000 (0:00:00.171)       0:07:31.097 ********
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 192.168.1.101]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.1.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015048", "end": "2020-03-28 13:32:41.288268", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:39.273220", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n\nerror #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "", "error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ***********************************************************************************************************
        to retry, use: --limit @/home/ubuntu/kubespray-2.12.5/cluster.retry

PLAY RECAP *******************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0
node1                      : ok=477  changed=64   unreachable=0    failed=1
node2                      : ok=432  changed=61   unreachable=0    failed=0
node3                      : ok=363  changed=55   unreachable=0    failed=0

Saturday 28 March 2020  13:32:41 +0000 (0:00:26.878)       0:07:57.976 ********
===============================================================================
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------ 26.88s
container-engine/docker : ensure docker packages are installed ------------------------------------------------------- 24.76s
download : download_container | Download image if required ------------------------------------------------------------ 8.70s
kubernetes/preinstall : Install packages requirements ----------------------------------------------------------------- 8.66s
etcd : Gen_certs | Write etcd master certs ---------------------------------------------------------------------------- 6.99s
download : download_container | Download image if required ------------------------------------------------------------ 6.56s
download : download_container | Download image if required ------------------------------------------------------------ 6.08s
download : download | Download files / images ------------------------------------------------------------------------- 5.63s
download : download_container | Download image if required ------------------------------------------------------------ 5.60s
container-engine/docker : ensure docker-ce repository is enabled ------------------------------------------------------ 5.42s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------- 5.33s
etcd : Gen_certs | Gather etcd master certs --------------------------------------------------------------------------- 5.09s
download : download_container | Download image if required ------------------------------------------------------------ 4.83s
download : download_container | Download image if required ------------------------------------------------------------ 4.74s
download : download_container | Download image if required ------------------------------------------------------------ 4.51s
download : download_file | Download item ------------------------------------------------------------------------------ 4.38s
download : download_container | Download image if required ------------------------------------------------------------ 4.12s
download : download_container | Download image if required ------------------------------------------------------------ 4.10s
download : download_file | Download item ------------------------------------------------------------------------------ 3.98s
download : download | Sync files / images from ansible host to nodes -------------------------------------------------- 3.89s

[node1 master]
Locally on the master node I get the following output on the command

sudo etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout

error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout

[node1 master]
my netstat -l produces the followin output on the master node

ubuntu@node1:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 localhost:domain        0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN
tcp        5      0 node1.cluster.loca:2379 0.0.0.0:*               LISTEN
tcp        1      0 localhost:2379          0.0.0.0:*               LISTEN
tcp        0      0 node1.cluster.loca:2380 0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:sunrpc          0.0.0.0:*               LISTEN
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN
udp        0      0 localhost:domain        0.0.0.0:*
udp        0      0 node1.cluster.lo:bootpc 0.0.0.0:*
udp        0      0 0.0.0.0:sunrpc          0.0.0.0:*
udp        0      0 0.0.0.0:648             0.0.0.0:*
udp6       0      0 [::]:sunrpc             [::]:*
udp6       0      0 [::]:648                [::]:*
raw6       0      0 [::]:ipv6-icmp          [::]:*                  7
Active UNIX domain sockets (only servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  2      [ ACC ]     SEQPACKET  LISTENING     13595    /run/udev/control
unix  2      [ ACC ]     STREAM     LISTENING     65462    /run/user/1000/systemd/private
unix  2      [ ACC ]     STREAM     LISTENING     65466    /run/user/1000/gnupg/S.gpg-agent.ssh
unix  2      [ ACC ]     STREAM     LISTENING     65467    /run/user/1000/gnupg/S.gpg-agent.extra
unix  2      [ ACC ]     STREAM     LISTENING     65468    /run/user/1000/snapd-session-agent.socket
unix  2      [ ACC ]     STREAM     LISTENING     65469    /run/user/1000/gnupg/S.gpg-agent
unix  2      [ ACC ]     STREAM     LISTENING     65470    /run/user/1000/gnupg/S.gpg-agent.browser
unix  2      [ ACC ]     STREAM     LISTENING     65471    /run/user/1000/gnupg/S.dirmngr
unix  2      [ ACC ]     STREAM     LISTENING     18326    /run/acpid.socket
unix  2      [ ACC ]     STREAM     LISTENING     13583    /run/systemd/private
unix  2      [ ACC ]     STREAM     LISTENING     13591    /run/rpcbind.sock
unix  2      [ ACC ]     STREAM     LISTENING     13593    /run/lvm/lvmpolld.socket
unix  2      [ ACC ]     STREAM     LISTENING     18338    /var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     13597    /run/systemd/journal/stdout
unix  2      [ ACC ]     STREAM     LISTENING     18346    /run/uuidd/request
unix  2      [ ACC ]     STREAM     LISTENING     18348    /run/snapd.socket
unix  2      [ ACC ]     STREAM     LISTENING     18350    /run/snapd-snap.socket
unix  2      [ ACC ]     STREAM     LISTENING     14190    /run/lvm/lvmetad.socket
unix  2      [ ACC ]     STREAM     LISTENING     14340    /run/systemd/fsck.progress
unix  2      [ ACC ]     STREAM     LISTENING     47817    /var/run/docker.sock
unix  2      [ ACC ]     STREAM     LISTENING     43013    /run/containerd/containerd.sock
unix  2      [ ACC ]     STREAM     LISTENING     47961    /var/run/docker.sock
unix  2      [ ACC ]     STREAM     LISTENING     47983    /var/run/docker/metrics.sock
unix  2      [ ACC ]     STREAM     LISTENING     48022    /run/docker/libnetwork/b668a8da96559ea4e366f5be123f78ca5a44559fa35150e9691e4c88c1ac22be.sock
unix  2      [ ACC ]     STREAM     LISTENING     60550    @/containerd-shim/moby/37e72223941a700e6b07933021dc1cb40ffa96e065ab83fa4a5597af9bff8039/shim.sock@
unix  2      [ ACC ]     STREAM     LISTENING     18352    @ISCSIADM_ABSTRACT_NAMESPACE
unix  2      [ ACC ]     STREAM     LISTENING     18353    /var/lib/lxd/unix.socket

[node1 master]
And the etcd service data:

ubuntu@node1:~$ systemctl cat etcd.service
# /etc/systemd/system/etcd.service
[Unit]
Description=etcd docker wrapper
Wants=docker.socket
After=docker.service

[Service]
User=root
PermissionsStartOnly=true
EnvironmentFile=-/etc/etcd.env
ExecStart=/usr/local/bin/etcd
ExecStartPre=-/usr/bin/docker rm -f etcd1
ExecStop=/usr/bin/docker stop etcd1
Restart=always
RestartSec=15s
TimeoutStartSec=30s

[Install]
WantedBy=multi-user.target
ubuntu@node1:~$ systemctl list-dependencies --reverse etcd.service
etcd.service
● └─multi-user.target
●   └─graphical.target

Any Ideas? I am really starting to question my sanity...

timsamart on 28 Mar 2020

Please check this:
https://github.com/kubernetes-sigs/kubespray/issues/2767#issuecomment-533844383

ewtang on 28 Mar 2020

I uninstalled docker and reran the script but it is the same output... @ewtang

timsamart on 28 Mar 2020

I also checked time sync with timedatectl which returns a synced status in case the certs are not validated:

                      Local time: Sun 2020-03-29 10:29:50 UTC
                  Universal time: Sun 2020-03-29 10:29:50 UTC
                        RTC time: Sun 2020-03-29 10:29:51
                       Time zone: Etc/UTC (UTC, +0000)
       System clock synchronized: yes
systemd-timesyncd.service active: yes
                 RTC in local TZ: no

timsamart on 29 Mar 2020

Ok the security group was the problem! Thanks @ewtang

timsamart on 29 Mar 2020

I tried creating an IPv6 only cluster on Fedora Coreos and met the same error. All etcd containers have the same IP (https://127.0.0.1:2379) and I think that is the problem. When i check the docker logs, i saw a warning like that.
I also wonder that if it is possible to create etcd cluster without using Kubespray and run Kubespray for only master and worker nodes. Any ideas?

eleblebici on 15 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 14 Jul 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 13 Aug 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 12 Sep 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.