Kubespray: etcd remove-node failing and ignoring the errors

Created on 26 Aug 2020 · 3Comments · Source: kubernetes-sigs/kubespray

Environment:

Cloud provider or hardware configuration:
AWS EC2 instances

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 3.10.0-1062.12.1.el7.x86_64 x86_64
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Version of Ansible (ansible --version):
ansible 2.9.6
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/managedxd/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/managedxd/.local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Version of Python (python --version):
Python 2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
5e22574402d86a7087fb146fc6f7b32c8fa80088 ../../3rdparty/kubespray (v2.1.0-3949-g5e225744)

Network plugin used:
Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/lacebal/554ea05590d459ced735af86fef2583c

Command used to invoke ansible:
ansible-playbook -v -i kubernetes-inventory -e '{"cloud_provider":"","delete_nodes_confirmation":"yes","node":"ip-10-195-56-75.eu-west-1.compute.internal","reset_nodes":false}' -e '{"ansible_python_interpreter":"/usr/bin/python2.7","ansible_user":"centos","kube_apiserver_port":"8081"}' --become --become-user=root kubespray/remove-node.yml

Output of ansible run:
https://gist.github.com/lacebal/239609ad06bc7dd1dfb28f2dab70ca0a

Anything else do we need to know:
The etcd node is never removed from the cluster and worst is that the process finish ok even if it has not being removed rendering the cluster totally unusable if the EC2 instance is removed.

The role remove-etcd-node/tasks/main.yml is behaving suspiciously
`---

name: Lookup node IP in kubernetes
shell: >-
{{ bin_dir }}/kubectl get nodes {{ node }}
-o jsonpath='{range.status.addresses[?(@.type=="InternalIP")]}{.address}{"\n"}{end}'
register: remove_node_ip
when:
- inventory_hostname in groups['etcd']
- ip is not defined
- access_ip is not defined
  
  delegate_to: "{{ groups['etcd']|first }}"
  
  failed_when: false

name: Set node IP
set_fact:
node_ip: "{{ ip | default(access_ip | default(remove_node_ip.stdout)) | trim }}"
when:
- inventory_hostname in groups['etcd']
name: Lookup etcd member id
shell: "{{ bin_dir }}/etcdctl --no-sync member list | grep {{ node_ip }} | cut -d: -f1"
register: etcd_member_id
ignore_errors: true
changed_when: false
check_mode: no
tags:
- facts
  environment:
  ETCDCTL_API: 2
  ETCDCTL_ENDPOINTS: "{{ etcd_access_addresses }}"
  ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ groups['etcd']|first }}.pem"
  ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ groups['etcd']|first }}-key.pem"
  ETCDCTL_CA_FILE: "{{ etcd_cert_dir }}/ca.pem"
  delegate_to: "{{ groups['etcd']|first }}"
  when:
- inventory_hostname in groups['etcd']
name: Remove etcd member from cluster
shell: "{{ bin_dir }}/etcdctl --no-sync member remove {{ etcd_member_id.stdout }}"
register: etcd_member_in_cluster
ignore_errors: false
retries: 6
delay: 5
until: etcd_member_in_cluster.rc == 0
changed_when: false
check_mode: no
tags:
- facts
  environment:
  ETCDCTL_API: 2
  ETCDCTL_ENDPOINTS: "{{ etcd_access_addresses }}"
  ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ groups['etcd']|first }}.pem"
  ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ groups['etcd']|first }}-key.pem"
  ETCDCTL_CA_FILE: "{{ etcd_cert_dir }}/ca.pem"
  delegate_to: "{{ groups['etcd']|first }}"
  when:
- inventory_hostname in groups['etcd']
- etcd_member_id.stdout | length > 0
  `

Kubectl is not installed on etcd nodes. So if not in the same machine as kubernetes master this is going to fail. Even if exectued on k8s-master this is going to fail as kubectl get nodes is not reporting etcd nodes.

TASK [remove-node/remove-etcd-node : Lookup node IP in kubernetes] *****************************
changed: [ip-10-195-56-75.eu-west-1.compute.internal -> 10.195.56.82] => {"changed": true, "cmd": "/usr/local/bin/kubectl get nodes ip-10-195-56-75.eu-west-1.compute.internal -o jsonpath='{range.status.addresses[?(@.type==\"InternalIP\")]}{.address}{\"\n\"}{end}'", "delta": "0:00:00.004094", "end": "2020-08-26 09:13:22.294026", "failed_when_result": false, "msg": "non-zero return code", "rc": 127, "start": "2020-08-26 09:13:22.289932", "stderr": "/bin/sh: /usr/local/bin/kubectl: No such file or directory", "stderr_lines": ["/bin/sh: /usr/local/bin/kubectl: No such file or directory"], "stdout": "", "stdout_lines": []}

As previous command is not working then Set Node IP is setting an empty node IP
TASK [remove-node/remove-etcd-node : Set node IP] ************************************
ok: [ip-10-195-56-75.eu-west-1.compute.internal] => {"ansible_facts": {"node_ip": ""}, "changed": false}

Looking up the member id without an empty IP is returning an error that is being ignored. Even with ignore_errors: false ansible ignore the error and continue on next task
TASK [remove-node/remove-etcd-node : Lookup etcd member id] ********************************
ok: [ip-10-195-56-75.eu-west-1.compute.internal -> 10.195.56.82] => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync member list | grep | cut -d: -f1", "delta": "0:00:00.034797", "end": "2020-08-26 09:13:22.812420", "rc": 0, "start": "2020-08-26 09:13:22.777623", "stderr": "Usage: grep [OPTION]... PATTERN [FILE]...\nTry 'grep --help' for more information.", "stderr_lines": ["Usage: grep [OPTION]... PATTERN [FILE]...", "Try 'grep --help' for more information."], "stdout": "", "stdout_lines": []}

Remove etcd member from cluster is ignored as the member id is not set. Last condition should be removed as not having the member id should fail and not being ignored as it means that the node has not being removed from the cluster ans is not secure to remove the instance
TASK [remove-node/remove-etcd-node : Remove etcd member from cluster] ****************************
skipping: [ip-10-195-56-75.eu-west-1.compute.internal] => {"changed": false, "skip_reason": "Conditional result was False"}

The cloud_provider is set as empty instead of using AWS as the kube_override_hostname is just empty and some remove roles just fails without being able to contact kubectl.

kinbug

Source

lacebal

All 3 comments

Addind ip={internal etcd node IP} to inventory file fixed the problem but the points stated are still valid as they task makes no sense and failing to remove the node from the cluster should be an stopping one

lacebal on 26 Aug 2020

👍1

Addind ip= to inventory file fixed the problem but the points stated are still valid as they task makes no sense and failing to remove the node from the cluster should be an stopping one

Agree with you, a failing task could be added.
Do you want to submit a PR or should we ?

floryut on 26 Aug 2020

Thanks for reporting back to me. If you don't mind I would prefer that you provide the PR as I'm not confident enough with Ansible nor Kubespray yet. If not possible I can provide it myself .

I solved locally the failing point removing the last condition (etcd_member_id.stdout | length > 0) of the when clause of "Remove etcd member from cluster" task. I also tried to fail on the previous task "Lookup etcd member id" but without any success as I'm quite new to Ansible and ignore_errors: false didn't work for me.

The first task seems to make no sense with kube version 1.18.4 (maybe in a previous different version this task reports the correct IP)

lacebal on 26 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings