Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):Linux 4.15.0-1039-azure x86_64
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
ansible --version):ansible 2.7.1
config file = /home/mau/.ansible.cfg
configured module search path = ['/home/mau/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib64/python3.5/site-packages/ansible
executable location = /usr/lib/python-exec/python3.5/ansible
python version = 3.5.4 (default, Feb 2 2018, 10:07:56) [GCC 6.4.0]
Kubespray version (commit) (git rev-parse --short HEAD):
3901480b
Network plugin used:
cloud
Copy of your inventory file:
[all]
master-0 ansible_ssh_host=10.100.2.4 ip=10.100.2.4
master-1 ansible_ssh_host=10.100.2.5 ip=10.100.2.5
master-2 ansible_ssh_host=10.100.2.6 ip=10.100.2.6
minion-0 ansible_ssh_host=10.100.3.4 ip=10.100.3.4
minion-1 ansible_ssh_host=10.100.3.5 ip=10.100.3.5
minion-2 ansible_ssh_host=10.100.3.6 ip=10.100.3.6
minion-3 ansible_ssh_host=10.100.3.7 ip=10.100.3.7
minion-4 ansible_ssh_host=10.100.3.8 ip=10.100.3.8
minion-5 ansible_ssh_host=10.100.3.9 ip=10.100.3.9
minion-6 ansible_ssh_host=10.100.3.10 ip=10.100.3.10
minion-7 ansible_ssh_host=10.100.3.11 ip=10.100.3.11
minion-8 ansible_ssh_host=10.100.3.12 ip=10.100.3.12
[kube-master]
master-0
master-1
master-2
[etcd]
master-0
master-1
master-2
[kube-node]
minion-0
minion-1
minion-2
minion-3
minion-4
minion-5
minion-6
minion-7
minion-8
[k8s-cluster:children]
kube-node
kube-master
Command used to invoke ansible:
ansible-playbook -i hosts.ini -e "@./kubespray/inventory/group_vars/etcd.yml" -e "@./kubespray/inventory/group_vars/*.yaml" --ssh-extra-args=-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o "ForwardAgent yes" -o ProxyCommand="ssh -o StrictHostKeyChecking=no -W %h:%p -q [email protected]" -u user -b ./kubespray/upgrade-cluster.yml
Output of ansible run:
TASK [container-engine/docker : Copy docker orphan clean up script to the node] ************************************************************************
Tuesday 02 April 2019 15:44:40 +0200 (0:00:01.229) 0:56:50.637 *********
TASK [container-engine/docker : Write docker orphan clean up systemd drop-in] **************************************************************************
Tuesday 02 April 2019 15:44:40 +0200 (0:00:00.240) 0:56:50.877 *********
RUNNING HANDLER [container-engine/docker : restart docker] *********************************************************************************************
Tuesday 02 April 2019 15:44:40 +0200 (0:00:00.160) 0:56:51.037 *********
changed: [minion-7]
changed: [minion-8]
RUNNING HANDLER [container-engine/docker : Docker | reload systemd] ************************************************************************************
Tuesday 02 April 2019 15:44:41 +0200 (0:00:00.696) 0:56:51.734 *********
changed: [minion-7]
changed: [minion-8]
RUNNING HANDLER [container-engine/docker : Docker | reload docker.socket] ******************************************************************************
Tuesday 02 April 2019 15:44:41 +0200 (0:00:00.782) 0:56:52.517 *********
RUNNING HANDLER [container-engine/docker : Docker | reload docker] *************************************************************************************
Tuesday 02 April 2019 15:44:42 +0200 (0:00:00.229) 0:56:52.746 *********
changed: [minion-7]
changed: [minion-8]
RUNNING HANDLER [container-engine/docker : Docker | pause while Docker restarts] ***********************************************************************
Tuesday 02 April 2019 15:45:04 +0200 (0:00:22.810) 0:57:15.556 *********
Pausing for 10 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
[container-engine/docker : Docker | pause while Docker restarts]
Waiting for docker restart:
ok: [minion-7]
RUNNING HANDLER [container-engine/docker : Docker | wait for docker] ***********************************************************************************
Tuesday 02 April 2019 15:45:15 +0200 (0:00:10.219) 0:57:25.775 *********
changed: [minion-7]
changed: [minion-8]
TASK [container-engine/docker : ensure docker service is started and enabled] **************************************************************************
Tuesday 02 April 2019 15:45:17 +0200 (0:00:02.001) 0:57:27.777 *********
ok: [minion-7] => (item=docker)
ok: [minion-8] => (item=docker)
Symptom:
The playbook restarts the Docker service without draining the node.
Expected behavior:
The Docker service should be restarted after draining the node.
Hi @mvalenzisiAK
Probably you can use my workaround to avoid this
https://github.com/kubernetes-sigs/kubespray/issues/4397#issuecomment-479931615
/kind bug
A possible solution is to add live-restore as kubespray parameter with true as default.
Only with Flannel this parameter is set to false (according to kubespray automation).
https://docs.docker.com/config/containers/live-restore/
@PierluigiLenociAkelius is it a requirement for flannel?
Only with Flannel this parameter is set to false
@killmeplz I don't know exactly. I presume that according to this:
https://github.com/kubernetes-sigs/kubespray/blob/3d2ea28c965a5cabe12774b130c9070e6bda2aeb/roles/network_plugin/flannel/handlers/main.yml#L34-L39
Moreover in this line someone wrote that flannel need the parameter set to true but is set to false.
when: is_atomic so that wouldn't apply to all distros.
Is there a reason why draining the current node is not one of the very first actions in the playbook?
Hi @mvalenzisiAK
Probably you can use my workaround to avoid this
#4397 (comment)
@killmeplz Thanks for the tip, it works!
_Graceful upgrade_ using upgrade-cluster.yml playbook restarts docker prior to draining node.
Upgrade path: 1.12.5 -> 1.13.5
Is there a reason for this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
@Miouge1 @mattymo any news about this bug?
When I did a recent upgrade using 2.11, it looked like draining happened automatically: roles/upgrade/pre-upgrade/tasks/main.yml
- name: Drain node
command: >-
{{ bin_dir }}/kubectl drain
--force
--ignore-daemonsets
--grace-period {{ drain_grace_period }}
--timeout {{ drain_timeout }}
--delete-local-data {{ inventory_hostname }}
{% if drain_pod_selector %}--pod-selector '{{ drain_pod_selector }}'{% endif %}
delegate_to: "{{ groups['kube-master'][0] }}"
when:
- drain_nodes
- needs_cordoning
Is this issue still relevant or is it fixed now?
This issue is still present in 2.12, because in upgrade-cluster.yaml the role container-engine (where docker is restarted) is called in a play (let's call it play A) which is run before the plays where the node is drained (in the role upgrade/pre-upgrade in plays Handle upgrades to master... and Finally handle worker upgrades...).
Is there a reason the role container-engine from play A couldn't be moved to the master and worker upgrade plays? Then play A could be un-serialized (made parallel) which would speed up the upgrade.
Most helpful comment
_Graceful upgrade_ using upgrade-cluster.yml playbook restarts docker prior to draining node.
Upgrade path: 1.12.5 -> 1.13.5
Is there a reason for this?