Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG
Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):Linux 3.10.0-862.2.3.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
ansible --version):ansible 2.5.2Kubespray version (commit) (git rev-parse --short HEAD):
REL v2.5.0
commit: 02cd5418
Network plugin used:
default: calico
Copy of your inventory file:
[all]
node1 ansible_host=192.168.140.191 ip=192.168.140.191
node2 ansible_host=192.168.140.192 ip=192.168.140.192
node3 ansible_host=192.168.140.193 ip=192.168.140.193
[kube-master]
node1
node2
[kube-node]
node1
node2
node3
[etcd]
node1
node2
node3
[k8s-cluster:children]
kube-node
kube-master
[calico-rr]
[vault]
node1
node2
node3
Command used to invoke ansible:
ansible-playbook --flush-cache -u myuser -b -i inventory/mycluster/hosts.ini cluster.yml
Output of ansible run:
Errors in trend of:
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []}
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.140.191:2379,https://192.168.140.192:2379,https://192.168.140.193:2379 member list | grep -q 192.168.140.191", "delta": "0:00:00.020942", "end": "2018-05-13 18:28:37.103184", "msg": "non-zero return code", "rc": 1, "start": "2018-05-13 18:28:37.082242", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused\n; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host\n; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host", "stderr_lines": ["client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.140.191:2379: getsockopt: connection refused", "; error #1: dial tcp 192.168.140.192:2379: getsockopt: no route to host", "; error #2: dial tcp 192.168.140.193:2379: getsockopt: no route to host"], "stdout": "", "stdout_lines": []}
and:
fatal: [node2]: FAILED! => {"attempts": 10, "changed": false, "content": "", "msg": "Status code was -1 and not [200]: Request failed: <urlopen error ('_ssl.c:563: The handshake operation timed out',)>", "redirected": false, "status": -1, "url": "https://192.168.140.192:2379/health"}
Anything else do we need to know:
Firewall disabled, ssh access + root priv work for ansible, sudo swapoff -a
It's fixed in this PR #2577
I am facing same issue TASK [etcd : Configure | Check if etcd cluster is healthy] ********************************
Thursday 17 May 2018 13:53:52 +0000 (0:00:00.508) 0:05:09.972 ***
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node2]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027081", "end": "2018-05-17 13:54:35.677364", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:29.650283", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.021069", "end": "2018-05-17 13:54:51.668894", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:45.647825", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.37:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://10.9.0.36:2379,https://10.9.0.37:2379,https://10.9.0.38:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.035036", "end": "2018-05-17 13:54:52.431413", "msg": "non-zero return code", "rc": 1, "start": "2018-05-17 13:54:46.396377", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\n; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\n; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout\n\nerror #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout\nerror #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout\nerror #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "; error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "; error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout", "", "error #0: client: endpoint https://10.9.0.37:2379 exceeded header timeout", "error #1: client: endpoint https://10.9.0.36:2379 exceeded header timeout", "error #2: client: endpoint https://10.9.0.38:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *********************************************
to retry, use: --limit @/etc/ansible/roles/kubespray/cluster.retry
PLAY RECAP *************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0
node1 : ok=177 changed=11 unreachable=0 failed=1
node2 : ok=173 changed=11 unreachable=0 failed=1
node3 : ok=173 changed=11 unreachable=0 failed=1
node4 : ok=152 changed=9 unreachable=0 failed=0
node5 : ok=149 changed=9 unreachable=0 failed=0
node6 : ok=149 changed=9 unreachable=0 failed=0
node7 : ok=149 changed=9 unreachable=0 failed=0
node8 : ok=149 changed=9 unreachable=0 failed=0
node9 : ok=149 changed=9 unreachable=0 failed=0
etcd : Configure | Check if etcd cluster is healthy -------------------------------------------------------------------------------------------------- 60.40s
gather facts from all instances ---------------------------------------------------------------------------------------------------------------------- 18.74s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------ 17.78s
kubernetes/preinstall : install growpart ------------------------------------------------------------------------------------------------------------- 14.13s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 8.41s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 7.29s
etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------------------- 6.78s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.79s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.70s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.63s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.61s
download : Sync container ----------------------------------------------------------------------------------------------------------------------------- 5.60s
download : Download items ----------------------------------------------------------------------------------------------------------------------------- 5.46s
docker : Ensure old versions of Docker are not installed. | RedHat ------------------------------------------------------------------------------------ 5.24s
etcd : Configure | Check if etcd-events cluster is healthy -------------------------------------------------------------------------------------------- 4.71s
kubernetes/preinstall : Hosts | populate inventory into hosts file ------------------------------------------------------------------------------------ 4.21s
download : container_download | Download containers if pull is required or told to always pull (all nodes) -------------------------------------------- 3.82s
docker : ensure docker packages are installed --------------------------------------------------------------------------------------------------------- 3.64s
kubernetes/preinstall : Update package management cache (YUM) - Redhat -------------------------------------------------------------------------------- 3.29s
kubernetes/preinstall : Create kubernetes directories ------------------------------------------------------------------------------------------------- 2.82s
[integrationteam@IntegrationTeam-Ansible-Vm1 kubespray]$
RHEL 7.2 ansible node
ansible==2.4.2.0
RHEL 7.5 Master and agent node
Same issue with Ubuntu 16.04.4
Ansible 2.5.1
Same issue using Vagrant with default configuration on current master
Vagrant 2.1.1
ansible 2.5.3
Had exactly same issue here,
using Centos7 3.10.0-693.21.1.el7.x86_64
ansible 2.5.4
Well your cluster is actually up and running its just that the health check is failing, here is my workaround.
You can manually check your cluster by providing cert and key values
SSH to one of the nodes and:
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
You should get successful check for all cluster members
Just remove/comment from all tasks:
Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy
Same health check is later failing in playbook for Calico | wait for etcd
You can also check that by doing
curl --cert /etc/ssl/etcd/ssl/member-node1.pem --key /etc/ssl/etcd/ssl/member-node1-key.pem https://127.0.0.1:2379/health
So just remove/comment all playbook task
Calico | wait for etcd
Hope this gets fixed soon, wasted a lot of time to figure this out
They checks:
Configure | Check if etcd cluster is healthy
Configure | Check if etcd-events cluster is healthy
Should not fail if the cluster is healthy and the certificates are present to check. Removing the checks are not a solution at all.
after investigating this, the unique way to replicate the issue in my case was using incorrect no_proxy env settings and http_proxy var in /etc/environment
I have just removed http_proxy in /etc/environment and fixed no_proxy environment.
example:
no_proxy: "localhost,127.0.0.1,.local.domain,10.3.0.1,10.3.0.2,10.3.0.4,10.3.0.5,10.3.0.6" # no_proxy for subnets is ignored
You must have all yours host IPs in the no_proxy when using proxy.
This was my case, don't know if it is yours.
There is also strange empty when here that I have removed in my tests:
https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/tasks/main.yml#L9
- include_tasks: "gen_certs_{{ cert_management }}.yml"
when:
tags:
- etcd-secrets
just to update, the above error is probably due to firewalld issue
on dev env just stop and disable firewalld service
on production open all relevant ports (2379, 2380 etc...)
running on Centos 7 Linux 4.17.3-1.el7.elrepo.x86_64 x86_64
I am having the same issue. Disabled firewall and it does not help. Running on CentOS 7
Actually with firewalld disabled it seems that it is starting to work. Does anyone know full list of ports I need to open?
Run on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd —reload
Run no all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
btw, SELinux is working fine, i did not had to do any adjustments or disable it
Same issue here
With Centos 7
fatal: [node3]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.5.70:2379,https://192.168.5.71:2379,https://192.168.5.72:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.239063", "end": "2018-07-16 15:27:23.188711", "msg": "non-zero return code", "rc": 1, "start": "2018-07-16 15:27:20.949648", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid\n; error #1: x509: certificate has expired or is not yet valid\n; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout\n\nerror #0: x509: certificate has expired or is not yet valid\nerror #1: x509: certificate has
expired or is not yet valid\nerror #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate has expired or is not yet valid", "; error #1: x509: certificate has expired or is not yet valid", "; error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout", "", "error #0: x509: certificate has expired or is not yet valid", "error #1: x509: certificate has expired or is not yet valid", "error #2: client: endpoint https://192.168.5.72:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
my configuration
> `[all]
> node1 ansible_host=192.168.5.70 ip=192.168.5.70
> node2 ansible_host=192.168.5.71 ip=192.168.5.71
> node3 ansible_host=192.168.5.72 ip=192.168.5.72
>
> [kube-master]
> node1
>
> [kube-node]
> node2
> node3
>
> [etcd]
> node1
> node2
> node3
>
> [k8s-cluster:children]
> kube-node
> kube-master
>
> [calico-rr]
>
> [vault]
> node1
> node2
> node3
Think I might have the same issue and can't figure out why.
I do get both the connection error as well as complaint on ca-cert's being selfsigned.
There is a task:
- name: Gen_certs | update ca-certificates (Debian/Ubuntu/Container Linux by CoreOS)
command: update-ca-certificates
when: etcd_ca_cert.changed
Not sure it's working as expected. It is successful but there is no update-ca-certificate script on my installation (CoreOS7).
So I'm also stuck on waiting for etcd health status to check out ok.
Will try the workaround disabling the check task for now. Noticed the update-ca-certificate is part of the overlay filesystem of the etcd container. Should that task really be run on the node?
This issue was stopping deployment in 2.6.0, but with 2.7.0 my Ubuntu 18.04 cluster gets deployed. However, there is still an etcd health check failing (it is ignored). As per @ArieLevs , I can confirm, running the etcdctl check with the certs on the command line works. I think the root cause of this error is NOT a firewall issue (although that has the same symptoms), it is a self-signed cert error. If you run the etcdctl in debug mode without the certs, it complains: error #0: remote error: tls: bad certificate
The offending check is in file kubespray/roles/etcd/tasks/configure.yml as follows:
name: Configure | Check if etcd cluster is healthy
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_is_healthy
until: etcd_cluster_is_healthy.rc == 0
retries: 4
delay: "{{ retry_stagger | random + 3 }}"
ignore_errors: false
changed_when: false
check_mode: no
when: is_etcd_master and etcd_cluster_setup
tags:
- facts
environment:
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"
I believe the environment variables here are not respected by etcdctl. A better way to do this (and that works) is in the Calico configuration where the certs are passed in via the command line as follows:
- name: Calico | wait for etcd
uri:
url: "{{ etcd_access_addresses.split(',') | first }}/health"
validate_certs: no
client_cert: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}.pem"
client_key: "{{ etcd_cert_dir }}/node-{{ inventory_hostname }}-key.pem"
register: result
until: result.status == 200 or result.status == 401
retries: 10
delay: 5
run_once: true
Alternatively, if you want to feed the cli arguments to the shell task:
- name: Configure | Check if etcd cluster is healthy
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} --cert-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key-file {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_is_healthy
ignore_errors: true
changed_when: false
check_mode: no
when: is_etcd_master and etcd_cluster_setup
tags:
- facts
environment:
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"
I'm having the same issue with default vars using vagrant.
I've tried to verify etcd cluster health with admin/member certificates still gets request exceeded error.
Is there any progress with this?
same issue and behavior with Ubuntu 16.04.
Ansible version 2.6.6
$ etcdctl --debug cluster-health
Cluster-Endpoints: http://127.0.0.1:4001, http://127.0.0.1:2379
cURL Command: curl -X GET http://127.0.0.1:4001/v2/members
cURL Command: curl -X GET http://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
; error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout
error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout
I have disabled the ufw but no luck. As mentioned above when I try to edit my inventory.cfg or hosts.ini file to use only one etcd, it will not work also.
My understanding from this weird behavior and from debugging it are the following:
--net host option so the container will run on the host network.--no-sync option. Example: etcdctl --no-sync --endpoint http://ip:2379 set /hello world.kill -9 "$(ps aux | grep etcd | grep -v grep | sed 's/^[^ ][^ ]*[ ][ ]*\([0-9][0-9]*\).*$/\1/g')"
etcd2 --name infra1 --initial-advertise-peer-urls http://10.0.0.101:2380 \
--listen-peer-urls http://IP:2380 \
--listen-client-urls http://IP:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://IP:2379 \
--discovery https://discovery.etcd.io/<token>
etcdctl --debug cluster-health
iptables -I INPUT -p tcp -m tcp --dport 2379 -j ACCEPT && iptables -I INPUT -p tcp -m tcp --dport 2380 -j ACCEPTFinally, I had to do this; delete the old etcd docker image and gcr.io/google_containers/cluster-proportional-autoscaler-amd64 to prevent k8s from getting back the old image of etcd and then I had to run the docker image manually including the ssl and certificates path and changing the behavior of the etcd docker container so it runs without the --net host option and get an IP from the docker0 interface, then expose the needed ports.
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd:v2.3.8 \
-name etcd0 \
-advertise-client-urls http://IP:2379,http://IP:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
-initial-advertise-peer-urls http://IP:2380 \
-listen-peer-urls http://0.0.0.0:2380 \
-initial-cluster-token etcd-cluster-1 \
-initial-cluster etcd0=http://IP1:2380,etcd1=http://IP2:2380,etcd2=http://IP3:2380 \
-initial-cluster-state new
I am still looking for any workaround/solution for this.
I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).
The above was failing every time. When I switched the setup to 1 bastion and 1 master and 2 nodes (neither master nor node having the floating IP associated), then after a little fiddling with the inventory/sample/no-floating.yml and moving it into correct inventory/$CLUSTER/ directory and running both terraform and ansible from the root of the kubespray git repo ... magic happened and the cluster was up and running without any further issue.
To conclude I would find nice to have an automated test for OpenStack deployment with a working setup. Even the howto guide should be slightly updated to reflect actual steps to be done.
To-Do: fix the deployment to work with DNS, w/o bastion and name-based certs (FreeIPA cert-monger would be nice?)
Eventually, I can create a pull request?
@PexMor hitting the same issue, OpenStack + Ubuntu, please take a look at #2606, those changes were approved but not merged
I also have the etcd health task failing. But what is wired if I run the task (after playbook is done) manually it works perfectly (setting the env's and calling cluster health).
As @ArieLevs said etcd seems healthy.
In my case I ran this successfully only after checking out release-2.8 branch instead of using master.
Used defaults and modified only hosts.ini
A note that configuration was exactly the same when tried to use release-2.8 and master
Errors that disappeared:
fatal: [k8s-1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.009745", "end": "2019-02-02 16:15:22.366223", "msg": "non-zero return code", "rc": 1, "start": "2019-02-02 16:15:22.356478", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused\n\nerror #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused", "", "error #0: dial tcp 172.17.8.101:2379: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
this error above was ignored and build continued
AND at the end:
fatal: [k8s-1]: FAILED! => {"msg": "The conditional check 'kube_token_auth' failed. The error was: error while evaluating conditional (kube_token_auth): 'kube_token_auth' is undefined\n\nThe error appears to have been in '/Users/music/Documents/git/kubespray/roles/kubernetes/tokens/tasks/check-tokens.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: \"Check_tokens | check if the tokens have already been generated on first master\"\n ^ here\n"}
Ansible 2.6.0
Vagrant 2.0.4
VirtualBox 5.2.26
Ubuntu 18.04
Ran Kubespray from Mac OS El Capitan
I have seen the very same error :-( It seems that the IP to be used by the etcd is not always appropriate for it. My deployment was on OpenStack+CoreOS using 1 master 2 nodes (pretty plain and basic setup). I have found that while having those exposed to public IP (all 3 nodes have had a floating_ip_address associated) and at the same time having internal IPs (from the internal subnet/lan) then the etcd is configured to use external/floating IP. Unfortunately, such IP is not present at the hosts and does not even make any sense to have such IP behind the router for etcd cluster (of 1 node).
This happened for me, but opening the security group to allow traffic to port 2379 from "everywhere" (laziness) on the master made it possible for itself to connect via the floating IP and the playbook could complete.
Seems to me that the solution is to either not use the floating IP or make sure that the security group allows access to it.
I ran into the same problem when trying to run kubespray against 3 bare metal Centos 7.6 servers.
It turns out that I had not set up the bare metal servers properly, because the system time was not correct on the three different machines. So what was happening was that kubespray generated certificates which had a start time which was greater than the system time on 2 out of 3 of my machines.
I solved this by installing chronyd and starting it on each machine, to set the correct time on each machine. I could have also installed ntpd.
it the same issue, etcd runs only on master/controller node, other nodes it is not running, no issue with firewall - it is even not running - RHEL 7.5 in AWS and no firewalld/iptables
fatal: [machine-01]: FAILED! => {
"attempts": 4,
"changed": false,
"cmd": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'",
"delta": "0:00:02.018172",
"end": "2019-02-15 17:22:52.241082",
"invocation": {
"module_args": {
"_raw_params": "/usr/local/bin/etcdctl --endpoints=https://10.14.5.141:2379,https://10.14.6.49:2379,https://10.14.7.118:2379 cluster-health | grep -q 'cluster is healthy'",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"warn": true
}
},
"msg": "non-zero return code",
"rc": 1,
"start": "2019-02-15 17:22:50.222910",
"stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\n; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\n; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused\n\nerror #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout\nerror #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused\nerror #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused",
"stderr_lines": [
"Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout",
"; error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused",
"; error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused",
"",
"error #0: client: endpoint https://10.14.5.141:2379 exceeded header timeout",
"error #1: dial tcp 10.14.7.118:2379: getsockopt: connection refused",
"error #2: dial tcp 10.14.6.49:2379: getsockopt: connection refused"
],
"stdout": "",
"stdout_lines": []
}
Hey guys,
A possible workaround for this issue is flush the iptables # iptables -F, this works for me.
setup:
CentOS Linux release 7.6.1810 (Core)
kubespray commit: a8dd69cf (git rev-parse --short HEAD)
cni: canal
I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?
I am having the same issue with latest release(V 2.9.0) on Ubuntu 16.04 with Firewall disabled on my machine.Did anyone resolve this issue ?
Have you tried flush your iptables?
Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places
TASK [etcd : Configure | Check if member is in etcd cluster] *****
Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 *
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}
After above error it continues the playbook but it fails at this place
TASK [etcd : Join Member | Add member to etcd cluster] *******
Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 *
FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []}
And it just stopped the playbook after this error
I am not sure how to debug this issue further :(
Hi Vterry Yes I have flush the Ip tables and I am still seeing these errors in following 2 places
TASK [etcd : Configure | Check if member is in etcd cluster] *****
Wednesday 10 April 2019 14:32:04 -0400 (0:00:00.118) 0:02:17.215 *
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --n o-sync --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,https ://192.168.19.249:2379 member list | grep -q 192.168.19.247", "delta": "0:00:00. 029426", "end": "2019-04-10 14:32:04.720756", "msg": "non-zero return code", "rc ": 1, "start": "2019-04-10 14:32:04.691330", "stderr": "", "stderr_lines": [], " stdout": "", "stdout_lines": []}After above error it continues the playbook but it fails at this place
TASK [etcd : Join Member | Add member to etcd cluster] *******
Wednesday 10 April 2019 14:32:07 -0400 (0:00:00.200) 0:02:20.150 *
FAILED - RETRYING: Join Member | Add member to etcd cluster (4 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (3 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (2 retries left).
FAILED - RETRYING: Join Member | Add member to etcd cluster (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": true, "cmd": "/usr/local/b in/etcdctl --endpoints=https://192.168.19.247:2379,https://192.168.19.248:2379,h ttps://192.168.19.249:2379 member add etcd1 https://192.168.19.247:2380", "delta ": "0:00:02.045849", "end": "2019-04-10 14:32:38.279139", "msg": "non-zero retur n code", "rc": 1, "start": "2019-04-10 14:32:36.233290", "stderr": "client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.19.250:4001 : getsockopt: connection refused\n; error #1: client: etcd member https://192.16 8.19.248:2379 has no leader\n; error #2: dial tcp 192.168.19.250:2379: getsockop t: connection refused\n; error #3: client: etcd member https://192.168.19.249:23 79 has no leader", "stderr_lines": ["client: etcd cluster is unavailable or misc onfigured; error #0: dial tcp 192.168.19.250:4001: getsockopt: connection refuse d", "; error #1: client: etcd member https://192.168.19.248:2379 has no leader", "; error #2: dial tcp 192.168.19.250:2379: getsockopt: connection refused", "; error #3: client: etcd member https://192.168.19.249:2379 has no leader"], "stdo ut": "", "stdout_lines": []}
And it just stopped the playbook after this error
I am not sure how to debug this issue further :(
Can u shared your host.ini and your all.yml?
Hi vterry,
I have pasting my hosts.ini,inventory.ini and all.yml. As its not allow me to attach the files .If you can share your email I can also attach those files as well.
**I am using inventory.ini because if use the hosts.ini I am getting the error Failed to Parse
-2.9.0/kubespray-2.9.0/inventory/mycluster/hosts.ini:4: Expected key=value host variable assignment, got: 192.168.19.247
File hosts.ini
all:
hosts:
node1:
access_ip: 192.168.19.247
ip: 192.168.19.247
ansible_host: 192.168.19.247
node2:
access_ip: 192.168.19.248
ip: 192.168.19.248
ansible_host: 192.168.19.248
node3:
access_ip: 192.168.19.249
ip: 192.168.19.249
ansible_host: 192.168.19.249
children:
kube-master:
hosts:
node1:
node2:
kube-node:
hosts:
node3:
node1:
node2:
etcd:
hosts:
node3:
node1:
node2:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}
File inventory.ini
[all]
node1 ansible_host=192.168.19.247 ip=192.168.19.247 etcd_member_name=etcd1
node2 ansible_host=192.168.19.248 ip=192.168.19.248 etcd_member_name=etcd2
node3 ansible_host=192.168.19.249 ip=192.168.19.249 etcd_member_name=etcd3
[kube-master]
node1
node2
[etcd]
node1
node2
node3
[kube-node]
node2
node3
[k8s-cluster:children]
kube-master
kube-node
File all.yml
etcd_data_dir: /var/lib/etcd
bin_dir: /usr/local/bin
nginx_kube_apiserver_port: 6443
nginx_kube_apiserver_healthcheck_port: 8081
kube_read_only_port: 10255
ansible_user: tmp1
ansible_password: password
ansible_become_pass: password
Has anyone, found a fix for this issue?
I tried deploying Kubernetes in combination with WireGuard. It just didn't work. After some deeper digging, I found out, that ip (public ip) instead of access_ip (private, WireGuard ip) is used as listening address for etcd.
This commit in my fork fixed it for me: https://github.com/bagbag/kubespray/commit/209eb8a5118bd61a178cd08b7d802100dfd4e32e
@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these roles
execute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp
firewall-cmd --permanent --add-port=2379/tcp
firewall-cmd --permanent --add-port=2380/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10251/tcp
firewall-cmd --permanent --add-port=10252/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --reload
execute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp
firewall-cmd --permanent --add-port=10250/tcp
firewall-cmd --permanent --add-port=10255/tcp
firewall-cmd --permanent --add-port=6783/tcp
firewall-cmd --permanent --add-port=443/tcp
firewall-cmd --reload
If installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp
firewall-cmd --permanent --add-port=5473/tcp
firewall-cmd --permanent --add-port=4789/udp
firewall-cmd --reload
and it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)
hi @ArieLevs I don't have any firwalld installed on the servers. Running ubuntu 16.04 Latest version of kubespray and ansible 2.7.10.
Failing on roles etcd/configure specifically health checks.
error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
Any help would be appreciated.
@markpenner34 i've noticed that etcd issue regarding port 4001 appears to occur on Ubuntu (while port 4001 is legacy and should not be used from etcd documentation)
What happens if you ssh to node1 and execute
etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
Try ports 4001 and 2379 (the certificate file paths may be different on Ubuntu, as this command was executed on Centos, change to relevant paths if needed)
btw, the response of \x15\x03\x01\x00\x02\x02 means a non https request
@ArieLevs
node 1 is not an etc node, this is my cluster.yml is this correct?
When i run sudo lsof -i:2379 on the etcd nodes i can see that their are no ports listening.
However when i run that on the docker container - running etc. I can see the ports are listening to correctly
@markpenner34
The config files look different for me, i use the official from https://github.com/kubernetes-sigs/kubespray#usage
So my inventory.ini only contain (everything else is commented out)
[k8s-cluster:children]
kube-master
kube-node
And the nodes information is declared at hosts.yml file
I'm sorry i cannot assist too much, as i've never deployed k8s (using kubespray) on ubuntu.
same issue on:
hi we are trying to create etcd cluster but we are facing the following error please find below error
Error: client: etcd cluster is unavailable or misconfigured
error #0: dial tcp 192.168.2.139:2379: getsockopt: connection refused
please help us to get out of this
thanks in advance
we are following the below link to implement the kubernetes on bare metal
My test as follows:
Captioned issue resulted:
fatal: [k8s-1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://172.17.8.101:2379,https://172.17.8.102:2379,https://172.17.8.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:06.027821", "end": "2019-07-23 00:28:29.583827", "msg": "non-zero return code", "rc": 1, "start": "2019-07-23 00:28:23.556006", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\n; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\n; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout\n\nerror #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout\nerror #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout\nerror #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "; error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "; error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout", "", "error #0: client: endpoint https://172.17.8.103:2379 exceeded header timeout", "error #1: client: endpoint https://172.17.8.101:2379 exceeded header timeout", "error #2: client: endpoint https://172.17.8.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
VM's have 2 network interfaces: eth0 for public network and eth1 for private network. Issue fixed if access_ip is assigned to public network ip and use access_ip instead of ip as etcd_address.
ok: [k8s-1] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.715343", "end": "2019-07-23 01:28:09.521148", "rc": 0, "start": "2019-07-23 01:28:04.805805", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [k8s-2] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:01.726660", "end": "2019-07-23 01:28:09.588888", "rc": 0, "start": "2019-07-23 01:28:07.862228", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [k8s-3] => {"attempts": 2, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.121.208:2379,https://192.168.121.108:2379,https://192.168.121.186:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.637851", "end": "2019-07-23 01:28:09.587249", "rc": 0, "start": "2019-07-23 01:28:04.949398", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Is this a proper fix?
I'm also experiencing this problem while using kubespray to try to deploy a 2-node Kubernetes cluster on OpenStack instances running Ubuntu 18.04.
How to reproduce:
create 2 OpenStack instances running Ubuntu 18.04
follow the instructions in Kubespray's Quick Start section setting ip with the node's private IP and access_ip with the node's floating IP, and also the node's ansible_user.
run ansible-playbook as stated in the Quick Start guide.
Here are the contents of ./inventory/mycluster/hosts.yml:
all:
hosts:
node1:
ansible_user: myuser
ansible_host: 185.178.87.56
ip: 192.168.0.8
access_ip: 185.178.87.56
node2:
ansible_user: myuser
ansible_host: 185.178.87.47
ip: 192.168.0.9
access_ip: 185.178.87.47
children:
kube-master:
hosts:
node1:
kube-node:
hosts:
node1:
node2:
etcd:
hosts:
node1:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}
Result:
TASK [etcd : Configure | Check if etcd cluster is healthy] **************************************************************************************************************************************************************
Thursday 01 August 2019 15:47:02 +0100 (0:00:00.023) 0:02:44.977 *******
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://185.178.87.56:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.018130", "end": "2019-08-01 14:47:30.642472", "msg": "non-zero return code", "rc": 1, "start": "2019-08-01 14:47:28.624342", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout\n\nerror #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout", "", "error #0: client: endpoint https://185.178.87.56:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *****************************************************************************************************************************
to retry, use: --limit @/home/rmam/development/CORDS/other/creodias_kubespray/kubespray/cluster.retry
PLAY RECAP *****************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=462 changed=12 unreachable=0 failed=1
node2 : ok=312 changed=9 unreachable=0 failed=0
For my test with vagrant provider=libvirt, the problem turned out to be that ip address 172.17.8.1 of the private (virtual) network is occasionally used as src ip in TLS handshake instead of the host ip 172.17.8.10x of the etcd cluster nodes.
<network ipv6='yes'>
<name>kubespray0</name>
<uuid>a502bbbb-7118-4e4a-8443-7ae1195dc93d</uuid>
<forward mode='nat'/>
<bridge name='virbr2' stp='on' delay='0'/>
<mac address='52:54:00:43:3f:ac'/>
<ip address='172.17.8.1' netmask='255.255.255.0'>
<dhcp>
<range start='172.17.8.1' end='172.17.8.254'/>
</dhcp>
</ip>
</network>
The workaround in such case is to add the relevant ip to the following setting:
etcd_cert_alt_ips: [172.17.8.1]
Hi, we are facing the same issue while deploying a 3 master 2 worker Kubernetes cluster on Azure.
Kubernetes Version: 1.15.3
Node Type : Azure VM
OS : CoreOs 1967.6.0
Kubespray : release-2.11
ETCD Version: 3.3.10
Surprisingly, cluster set up worked fine few days back with this code.
Any help would be greatly appreciated.
Following is the etcd status on a Node
$ etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member.pem --key-file=/etc/ssl/etcd/ssl/member-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused
error #0: dial tcp 127.0.0.1:2379: connect: connection refused
ETCD gets messed up with wrong IP address and keeps crashing
ETCD_INITIAL_CLUSTER=master01=https://x.y.z.47:2380,master02=https://x.y.z.46:2380,master03=https://x.y.z.45:2380
...
Sep 19 02:37:43 kubemaster01 etcd[12077]: 2019-09-19 02:37:43.056677 C | etcdmain: listen tcp x.y.z.47:2380: bind: cannot assign requested address
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Sep 19 02:37:43 kubemaster01 systemd[1]: etcd.service: Failed with result 'exit-code'.
while the ip address of masters are different
[all]
kubemaster01 ansible_host=x.y.z.44 node_name=kubemaster01 etcd_member_name=master01
kubemaster02 ansible_host=x.y.z.45 node_name=kubemaster02 etcd_member_name=master02
kubemaster03 ansible_host=x.y.z.43 node_name=kubemaster03 etcd_member_name=master03
kubenode01 ansible_host=x.y.z.47 node_name=kubenode01
kubenode02 ansible_host=x.y.z.46 node_name=kubenode02
bastion101 ansible_host=bastion101
[bastion]
bastion101
[master]
kubemaster01
kubemaster02
kubemaster03
[etcd]
kubemaster01
kubemaster02
kubemaster03
[node]
kubenode01
kubenode02
[k8s-cluster:children]
master
node
[kube-master:children]
master
[kube-node:children]
node
[calico-rr]
[vault]
kubemaster01
kubemaster02
kubemaster03
We found the root case : The issue was caused due to Wrong Cache files of Ansible.
IP of VM changed when we recreated VM using Terraform scripts
Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.
moved this comment to https://github.com/kubernetes-sigs/kubespray/issues/5118#issuecomment-533837327 as i think it is actually that bug and not this one
Please check if docker is installed on vagrant host. If so, please uninstall and reboot. Then try again.
etcd 无法启动
经过重启etcd 观察日志发现,etcd监控了2379 端口,但是却无法访问
深入观察发现kube-proxy 占用了2379端口,因此推测有人启动了一个svc , nodeport 占了2379端口
iptables-save >a
cat a
搜索2379 能看到是 什么nodeport 占用了
根据kube-proxy 可以知道,如果nodeport的endpoint 没有启动,会写一个 -j reject的规则来拒绝
因此即便kube-proxy 没有占用2379 , etcd 监听的2379 也会被拒绝
解决方法:
思路: 让etcd 启动,干掉那个nodeport sv
3和4 的目的是让kube-proxy 无法启动,不能改iptables
@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these rolesexecute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp firewall-cmd --permanent --add-port=2379/tcp firewall-cmd --permanent --add-port=2380/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10251/tcp firewall-cmd --permanent --add-port=10252/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --reloadexecute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --permanent --add-port=6783/tcp firewall-cmd --permanent --add-port=443/tcp firewall-cmd --reloadIf installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp firewall-cmd --permanent --add-port=5473/tcp firewall-cmd --permanent --add-port=4789/udp firewall-cmd --reloadand it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)
Thank you, bro =)
May be Ansible should be run this (add the firewall rules)?
I've spend many time to find the solution. I think kubespray do everything that I need to install K8S.
Same error. In my condition, it`s because of the ntp is not updated, and just sync the time in all nodes by "ntpdate time.windows.com". After that, it works.
you can try to reset --data_dir.
@markpenner34
I've just deployed a 3 node cluster on a centos7 with kernel 5.1.3-1, ansible 2.8.0, using latest kubespray repo (from a week ago) with selinux on, and firewalld on with these rolesexecute on master nodes:
firewall-cmd --permanent --add-port=6443/tcp firewall-cmd --permanent --add-port=2379/tcp firewall-cmd --permanent --add-port=2380/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10251/tcp firewall-cmd --permanent --add-port=10252/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --reloadexecute on all nodes:
firewall-cmd --permanent --add-port=30000-32767/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10255/tcp firewall-cmd --permanent --add-port=6783/tcp firewall-cmd --permanent --add-port=443/tcp firewall-cmd --reloadIf installing Calico open these ports on all nodes:
firewall-cmd --permanent --add-port=179/tcp firewall-cmd --permanent --add-port=5473/tcp firewall-cmd --permanent --add-port=4789/udp firewall-cmd --reloadand it all went perfectly fine.
what is the error you are getting? (please don't bomb with a really long log)
@GRomR1
Thank you. This helped a lot. I ran into issues with one of the RHEL clusters (different that my first cluster where kubespray just worked fine without any hitches). However in the second cluster, I ran into the no-route/connection-refused issues ... drove me crazy until I saw your post ... and manually opened up the ports as you described and that worked. Thanks a lot!
We found the root case : The issue was caused due to Wrong Cache files of Ansible.
IP of VM changed when we recreated VM using Terraform scripts
Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.
Thanks for this! I was rebuilding my cluster with Terraform as well. It seems that JSON fact_caching is enabled in ansible.conf.
fact_caching = jsonfile
fact_caching_connection = /tmp
https://docs.ansible.com/ansible/latest/plugins/cache/jsonfile.html
Default timeout is 24 hours.
fact_caching_timeout = 86400
If you are re-deploying hosts with the same hostname, but with different IP address', within the 24 hour timeout you should expect this error (unless you clear cache in /tmp).
We found the root case : The issue was caused due to Wrong Cache files of Ansible.
IP of VM changed when we recreated VM using Terraform scripts
Ansible was failing to overwrite it's cached JSON files with new IP information (due to permission issue).
So the IP information was referred from old cache.
I'm having the same issue on Azure VM with KubeSpray 2.12.3
Are you using weave as net plugin?
I've already cleaned up the caches and the issue is still here. :(
Even if I try from the inside the VM:
/usr/local/bin/etcdctl --no-sync --endpoints=https://127.0.0.1:2379 cluster-health | grep -q 'cluster is healthy'
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout
error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout
Any ideas?
I have the same problem! Everything works until etcd is tested:
I use the basic Kubespray script from the release version 2.12.5 (I also tested the Master branch and 2.12.4 same issue) on 3 x Ubuntu 18.04 instances on an openstack cloud. I use terraform to construct the VMs and a basic Security group enabling the following ports:
# Security Group
resource "opentelekomcloud_compute_secgroup_v2" "secgroup" {
name = "secgroup"
description = "Security group for the Terraform example instances"
rule {
from_port = 22
to_port = 22
ip_protocol = "tcp"
cidr = "0.0.0.0/0"
}
rule {
from_port = 80
to_port = 80
ip_protocol = "tcp"
cidr = "0.0.0.0/0"
}
rule {
from_port = 8081
to_port = 8081
ip_protocol = "tcp"
cidr = "0.0.0.0/0"
}
rule {
from_port = 8080
to_port = 8080
ip_protocol = "tcp"
cidr = "0.0.0.0/0"
}
rule {
from_port = 2397
to_port = 2397
ip_protocol = "tcp"
cidr = "0.0.0.0/0"
}
rule {
from_port = -1
to_port = -1
ip_protocol = "icmp"
cidr = "0.0.0.0/0"
}
}
[Deployment Machine in the same Subnet]
Output of the standard Kubespray deploay script via ansible:
TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************
Saturday 28 March 2020 13:32:03 +0000 (0:00:00.381) 0:07:20.485 ********
fatal: [node1]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.012101", "end": "2020-03-28 13:32:09.021030", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.008929", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout\n; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\n; error #2: dial tcp 192.168.1.101:2379: connect: connection refused\n\nerror #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout\nerror #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\nerror #2: dial tcp 192.168.1.101:2379: connect: connection refused", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "; error #2: dial tcp 192.168.1.101:2379: connect: connection refused", "", "error #0: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "error #2: dial tcp 192.168.1.101:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [node2]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.011963", "end": "2020-03-28 13:32:09.075109", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.063146", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\n; error #2: dial tcp 192.168.1.102:2379: connect: connection refused\n\nerror #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\nerror #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout\nerror #2: dial tcp 192.168.1.102:2379: connect: connection refused", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "; error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "; error #2: dial tcp 192.168.1.102:2379: connect: connection refused", "", "error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "error #1: client: endpoint https://192.168.1.103:2379 exceeded header timeout", "error #2: dial tcp 192.168.1.102:2379: connect: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring
fatal: [node3]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://192.168.1.101:2379,https://192.168.1.102:2379,https://192.168.1.103:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:04.012279", "end": "2020-03-28 13:32:09.106577", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:05.094298", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.103:2379: connect: connection refused\n; error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n; error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout\n\nerror #0: dial tcp 192.168.1.103:2379: connect: connection refused\nerror #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout\nerror #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 192.168.1.103:2379: connect: connection refused", "; error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "; error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout", "", "error #0: dial tcp 192.168.1.103:2379: connect: connection refused", "error #1: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "error #2: client: endpoint https://192.168.1.102:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
...ignoring
TASK [etcd : Configure | Check if etcd-events cluster is healthy] ************************************************************
Saturday 28 March 2020 13:32:09 +0000 (0:00:05.332) 0:07:25.817 ********
TASK [etcd : include_tasks] **************************************************************************************************
Saturday 28 March 2020 13:32:09 +0000 (0:00:00.144) 0:07:25.962 ********
included: /home/ubuntu/kubespray-2.12.5/roles/etcd/tasks/refresh_config.yml for node1, node2, node3
TASK [etcd : Refresh config | Create etcd config file] ***********************************************************************
Saturday 28 March 2020 13:32:09 +0000 (0:00:00.223) 0:07:26.186 ********
changed: [node1]
changed: [node2]
changed: [node3]
TASK [etcd : Refresh config | Create etcd-events config file] ****************************************************************
Saturday 28 March 2020 13:32:11 +0000 (0:00:01.923) 0:07:28.109 ********
TASK [etcd : Configure | Copy etcd.service systemd file] *********************************************************************
Saturday 28 March 2020 13:32:11 +0000 (0:00:00.143) 0:07:28.253 ********
changed: [node1]
changed: [node2]
changed: [node3]
TASK [etcd : Configure | Copy etcd-events.service systemd file] **************************************************************
Saturday 28 March 2020 13:32:12 +0000 (0:00:00.843) 0:07:29.096 ********
TASK [etcd : Configure | reload systemd] *************************************************************************************
Saturday 28 March 2020 13:32:12 +0000 (0:00:00.150) 0:07:29.246 ********
ok: [node1]
ok: [node2]
ok: [node3]
TASK [etcd : Configure | Ensure etcd is running] *****************************************************************************
Saturday 28 March 2020 13:32:13 +0000 (0:00:00.754) 0:07:30.001 ********
changed: [node1]
changed: [node2]
changed: [node3]
TASK [etcd : Configure | Ensure etcd-events is running] **********************************************************************
Saturday 28 March 2020 13:32:14 +0000 (0:00:00.925) 0:07:30.926 ********
TASK [etcd : Configure | Check if etcd cluster is healthy] *******************************************************************
Saturday 28 March 2020 13:32:14 +0000 (0:00:00.171) 0:07:31.097 ********
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (4 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (3 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (2 retries left).
FAILED - RETRYING: Configure | Check if etcd cluster is healthy (1 retries left).
fatal: [node1 -> 192.168.1.101]: FAILED! => {"attempts": 4, "changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync --endpoints=https://192.168.1.101:2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:02.015048", "end": "2020-03-28 13:32:41.288268", "msg": "non-zero return code", "rc": 1, "start": "2020-03-28 13:32:39.273220", "stderr": "Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout\n\nerror #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "stderr_lines": ["Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout", "", "error #0: client: endpoint https://192.168.1.101:2379 exceeded header timeout"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT ***********************************************************************************************************
to retry, use: --limit @/home/ubuntu/kubespray-2.12.5/cluster.retry
PLAY RECAP *******************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=0
node1 : ok=477 changed=64 unreachable=0 failed=1
node2 : ok=432 changed=61 unreachable=0 failed=0
node3 : ok=363 changed=55 unreachable=0 failed=0
Saturday 28 March 2020 13:32:41 +0000 (0:00:26.878) 0:07:57.976 ********
===============================================================================
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------ 26.88s
container-engine/docker : ensure docker packages are installed ------------------------------------------------------- 24.76s
download : download_container | Download image if required ------------------------------------------------------------ 8.70s
kubernetes/preinstall : Install packages requirements ----------------------------------------------------------------- 8.66s
etcd : Gen_certs | Write etcd master certs ---------------------------------------------------------------------------- 6.99s
download : download_container | Download image if required ------------------------------------------------------------ 6.56s
download : download_container | Download image if required ------------------------------------------------------------ 6.08s
download : download | Download files / images ------------------------------------------------------------------------- 5.63s
download : download_container | Download image if required ------------------------------------------------------------ 5.60s
container-engine/docker : ensure docker-ce repository is enabled ------------------------------------------------------ 5.42s
etcd : Configure | Check if etcd cluster is healthy ------------------------------------------------------------------- 5.33s
etcd : Gen_certs | Gather etcd master certs --------------------------------------------------------------------------- 5.09s
download : download_container | Download image if required ------------------------------------------------------------ 4.83s
download : download_container | Download image if required ------------------------------------------------------------ 4.74s
download : download_container | Download image if required ------------------------------------------------------------ 4.51s
download : download_file | Download item ------------------------------------------------------------------------------ 4.38s
download : download_container | Download image if required ------------------------------------------------------------ 4.12s
download : download_container | Download image if required ------------------------------------------------------------ 4.10s
download : download_file | Download item ------------------------------------------------------------------------------ 3.98s
download : download | Sync files / images from ansible host to nodes -------------------------------------------------- 3.89s
[node1 master]
Locally on the master node I get the following output on the command
sudo etcdctl --endpoints https://127.0.0.1:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/member-node1.pem --key-file=/etc/ssl/etcd/ssl/member-node1-key.pem --debug cluster-health
Cluster-Endpoints: https://127.0.0.1:2379
cURL Command: curl -X GET https://127.0.0.1:2379/v2/members
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout
error #0: client: endpoint https://127.0.0.1:2379 exceeded header timeout
[node1 master]
my netstat -l produces the followin output on the master node
ubuntu@node1:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:domain 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN
tcp 5 0 node1.cluster.loca:2379 0.0.0.0:* LISTEN
tcp 1 0 localhost:2379 0.0.0.0:* LISTEN
tcp 0 0 node1.cluster.loca:2380 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:sunrpc 0.0.0.0:* LISTEN
tcp6 0 0 [::]:ssh [::]:* LISTEN
tcp6 0 0 [::]:sunrpc [::]:* LISTEN
udp 0 0 localhost:domain 0.0.0.0:*
udp 0 0 node1.cluster.lo:bootpc 0.0.0.0:*
udp 0 0 0.0.0.0:sunrpc 0.0.0.0:*
udp 0 0 0.0.0.0:648 0.0.0.0:*
udp6 0 0 [::]:sunrpc [::]:*
udp6 0 0 [::]:648 [::]:*
raw6 0 0 [::]:ipv6-icmp [::]:* 7
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node Path
unix 2 [ ACC ] SEQPACKET LISTENING 13595 /run/udev/control
unix 2 [ ACC ] STREAM LISTENING 65462 /run/user/1000/systemd/private
unix 2 [ ACC ] STREAM LISTENING 65466 /run/user/1000/gnupg/S.gpg-agent.ssh
unix 2 [ ACC ] STREAM LISTENING 65467 /run/user/1000/gnupg/S.gpg-agent.extra
unix 2 [ ACC ] STREAM LISTENING 65468 /run/user/1000/snapd-session-agent.socket
unix 2 [ ACC ] STREAM LISTENING 65469 /run/user/1000/gnupg/S.gpg-agent
unix 2 [ ACC ] STREAM LISTENING 65470 /run/user/1000/gnupg/S.gpg-agent.browser
unix 2 [ ACC ] STREAM LISTENING 65471 /run/user/1000/gnupg/S.dirmngr
unix 2 [ ACC ] STREAM LISTENING 18326 /run/acpid.socket
unix 2 [ ACC ] STREAM LISTENING 13583 /run/systemd/private
unix 2 [ ACC ] STREAM LISTENING 13591 /run/rpcbind.sock
unix 2 [ ACC ] STREAM LISTENING 13593 /run/lvm/lvmpolld.socket
unix 2 [ ACC ] STREAM LISTENING 18338 /var/run/dbus/system_bus_socket
unix 2 [ ACC ] STREAM LISTENING 13597 /run/systemd/journal/stdout
unix 2 [ ACC ] STREAM LISTENING 18346 /run/uuidd/request
unix 2 [ ACC ] STREAM LISTENING 18348 /run/snapd.socket
unix 2 [ ACC ] STREAM LISTENING 18350 /run/snapd-snap.socket
unix 2 [ ACC ] STREAM LISTENING 14190 /run/lvm/lvmetad.socket
unix 2 [ ACC ] STREAM LISTENING 14340 /run/systemd/fsck.progress
unix 2 [ ACC ] STREAM LISTENING 47817 /var/run/docker.sock
unix 2 [ ACC ] STREAM LISTENING 43013 /run/containerd/containerd.sock
unix 2 [ ACC ] STREAM LISTENING 47961 /var/run/docker.sock
unix 2 [ ACC ] STREAM LISTENING 47983 /var/run/docker/metrics.sock
unix 2 [ ACC ] STREAM LISTENING 48022 /run/docker/libnetwork/b668a8da96559ea4e366f5be123f78ca5a44559fa35150e9691e4c88c1ac22be.sock
unix 2 [ ACC ] STREAM LISTENING 60550 @/containerd-shim/moby/37e72223941a700e6b07933021dc1cb40ffa96e065ab83fa4a5597af9bff8039/shim.sock@
unix 2 [ ACC ] STREAM LISTENING 18352 @ISCSIADM_ABSTRACT_NAMESPACE
unix 2 [ ACC ] STREAM LISTENING 18353 /var/lib/lxd/unix.socket
[node1 master]
And the etcd service data:
ubuntu@node1:~$ systemctl cat etcd.service
# /etc/systemd/system/etcd.service
[Unit]
Description=etcd docker wrapper
Wants=docker.socket
After=docker.service
[Service]
User=root
PermissionsStartOnly=true
EnvironmentFile=-/etc/etcd.env
ExecStart=/usr/local/bin/etcd
ExecStartPre=-/usr/bin/docker rm -f etcd1
ExecStop=/usr/bin/docker stop etcd1
Restart=always
RestartSec=15s
TimeoutStartSec=30s
[Install]
WantedBy=multi-user.target
ubuntu@node1:~$ systemctl list-dependencies --reverse etcd.service
etcd.service
● └─multi-user.target
● └─graphical.target
Any Ideas? I am really starting to question my sanity...
Please check this:
https://github.com/kubernetes-sigs/kubespray/issues/2767#issuecomment-533844383
I uninstalled docker and reran the script but it is the same output... @ewtang
I also checked time sync with timedatectl which returns a synced status in case the certs are not validated:
Local time: Sun 2020-03-29 10:29:50 UTC
Universal time: Sun 2020-03-29 10:29:50 UTC
RTC time: Sun 2020-03-29 10:29:51
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: yes
RTC in local TZ: no
Ok the security group was the problem! Thanks @ewtang
I tried creating an IPv6 only cluster on Fedora Coreos and met the same error. All etcd containers have the same IP (https://127.0.0.1:2379) and I think that is the problem. When i check the docker logs, i saw a warning like that.
I also wonder that if it is possible to create etcd cluster without using Kubespray and run Kubespray for only master and worker nodes. Any ideas?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
Run on master nodes:
Run no all nodes:
btw, SELinux is working fine, i did not had to do any adjustments or disable it