Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Version of Ansible (ansible --version):
ansible 2.9.6
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/managedxd/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/managedxd/.local/lib/python3.6/site-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Version of Python (python --version):
Python 2.7.5
Kubespray version (commit) (git rev-parse --short HEAD):
5e22574402d86a7087fb146fc6f7b32c8fa80088 ../../3rdparty/kubespray (v2.1.0-3949-g5e225744)
Network plugin used:
Calico
Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/lacebal/554ea05590d459ced735af86fef2583c
Command used to invoke ansible:
ansible-playbook -v -i kubernetes-inventory -e '{"cloud_provider":"","delete_nodes_confirmation":"yes","node":"ip-10-195-56-75.eu-west-1.compute.internal","reset_nodes":false}' -e '{"ansible_python_interpreter":"/usr/bin/python2.7","ansible_user":"centos","kube_apiserver_port":"8081"}' --become --become-user=root kubespray/remove-node.yml
Output of ansible run:
https://gist.github.com/lacebal/239609ad06bc7dd1dfb28f2dab70ca0a
Anything else do we need to know:
The etcd node is never removed from the cluster and worst is that the process finish ok even if it has not being removed rendering the cluster totally unusable if the EC2 instance is removed.
The role remove-etcd-node/tasks/main.yml is behaving suspiciously
`---
name: Set node IP
set_fact:
node_ip: "{{ ip | default(access_ip | default(remove_node_ip.stdout)) | trim }}"
when:
name: Lookup etcd member id
shell: "{{ bin_dir }}/etcdctl --no-sync member list | grep {{ node_ip }} | cut -d: -f1"
register: etcd_member_id
ignore_errors: true
changed_when: false
check_mode: no
tags:
name: Remove etcd member from cluster
shell: "{{ bin_dir }}/etcdctl --no-sync member remove {{ etcd_member_id.stdout }}"
register: etcd_member_in_cluster
ignore_errors: false
retries: 6
delay: 5
until: etcd_member_in_cluster.rc == 0
changed_when: false
check_mode: no
tags:
Kubectl is not installed on etcd nodes. So if not in the same machine as kubernetes master this is going to fail. Even if exectued on k8s-master this is going to fail as kubectl get nodes is not reporting etcd nodes.
TASK [remove-node/remove-etcd-node : Lookup node IP in kubernetes] *****************************
changed: [ip-10-195-56-75.eu-west-1.compute.internal -> 10.195.56.82] => {"changed": true, "cmd": "/usr/local/bin/kubectl get nodes ip-10-195-56-75.eu-west-1.compute.internal -o jsonpath='{range.status.addresses[?(@.type==\"InternalIP\")]}{.address}{\"\n\"}{end}'", "delta": "0:00:00.004094", "end": "2020-08-26 09:13:22.294026", "failed_when_result": false, "msg": "non-zero return code", "rc": 127, "start": "2020-08-26 09:13:22.289932", "stderr": "/bin/sh: /usr/local/bin/kubectl: No such file or directory", "stderr_lines": ["/bin/sh: /usr/local/bin/kubectl: No such file or directory"], "stdout": "", "stdout_lines": []}
As previous command is not working then Set Node IP is setting an empty node IP
TASK [remove-node/remove-etcd-node : Set node IP] ************************************
ok: [ip-10-195-56-75.eu-west-1.compute.internal] => {"ansible_facts": {"node_ip": ""}, "changed": false}
Looking up the member id without an empty IP is returning an error that is being ignored. Even with ignore_errors: false ansible ignore the error and continue on next task
TASK [remove-node/remove-etcd-node : Lookup etcd member id] ********************************
ok: [ip-10-195-56-75.eu-west-1.compute.internal -> 10.195.56.82] => {"changed": false, "cmd": "/usr/local/bin/etcdctl --no-sync member list | grep | cut -d: -f1", "delta": "0:00:00.034797", "end": "2020-08-26 09:13:22.812420", "rc": 0, "start": "2020-08-26 09:13:22.777623", "stderr": "Usage: grep [OPTION]... PATTERN [FILE]...\nTry 'grep --help' for more information.", "stderr_lines": ["Usage: grep [OPTION]... PATTERN [FILE]...", "Try 'grep --help' for more information."], "stdout": "", "stdout_lines": []}
Remove etcd member from cluster is ignored as the member id is not set. Last condition should be removed as not having the member id should fail and not being ignored as it means that the node has not being removed from the cluster ans is not secure to remove the instance
TASK [remove-node/remove-etcd-node : Remove etcd member from cluster] ****************************
skipping: [ip-10-195-56-75.eu-west-1.compute.internal] => {"changed": false, "skip_reason": "Conditional result was False"}
The cloud_provider is set as empty instead of using AWS as the kube_override_hostname is just empty and some remove roles just fails without being able to contact kubectl.
Addind ip={internal etcd node IP} to inventory file fixed the problem but the points stated are still valid as they task makes no sense and failing to remove the node from the cluster should be an stopping one
Addind ip= to inventory file fixed the problem but the points stated are still valid as they task makes no sense and failing to remove the node from the cluster should be an stopping one
Agree with you, a failing task could be added.
Do you want to submit a PR or should we ?
Thanks for reporting back to me. If you don't mind I would prefer that you provide the PR as I'm not confident enough with Ansible nor Kubespray yet. If not possible I can provide it myself .
I solved locally the failing point removing the last condition (etcd_member_id.stdout | length > 0) of the when clause of "Remove etcd member from cluster" task. I also tried to fail on the previous task "Lookup etcd member id" but without any success as I'm quite new to Ansible and ignore_errors: false didn't work for me.
The first task seems to make no sense with kube version 1.18.4 (maybe in a previous different version this task reports the correct IP)