Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Environment:
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Debian Stretch
Version of Ansible (ansible --version): 2.4.1.0
Kubespray version (commit) (git rev-parse --short HEAD): 2.5.0
Network plugin used: calico
Command used to invoke ansible: ansible-playbook upgrade-cluster.yml
Output of ansible run: Succeeded
Anything else do we need to know:
I upgraded from Kubespray version 2.3.0 to 2.5.0 by using the upgrade-cluster.yml playbook. Everything went fine, but all calico pods moved into CrashLoopBackOff. The logs were showing the following:
Calico node 'node1' is already using the IPv4 address <IP>.
After some investigation I found out that the upgrade had changed the node names from node1 to node1.domain.com, i.e. from pure hostname to FQDN. But the calico nodes where still registered with the old names.
A kubectl get nodes also showed both versions of the nodes, with pure hostname and with FQDN, the hostname variants had the before-upgrade kubelet version and state "NotReady". Naively, I did a kubectl delete node on all the old nodes. That, of course, didn't fix the problem.
After some more investigation I tried running calicoctl to delete the old nodes from etcd. However, the old nodes didn't show up in calicoctl get nodes, only the new ones.
What finally solved the problem for me: I changed back the node names from FQDN to hostname only in /etc/kubernetes/kubelet.env and /etc/cni/net.d/10-calico.conflist and restarted the kubelet on all nodes. The calico pods started up again.
I'm pretty sure this commit caused the problem for me: https://github.com/kubernetes-incubator/kubespray/commit/ad9049a49ed44104558ac8820f1ca3e51c29088b
Exchanging ansible_hostname with inventory_hostname changed my node names to FQDN, like they are defined in my inventory.
Thank you @danielm0hr, for pointing to the right direction and providing the workaround. I have "reverted" the changes from https://github.com/kubernetes-incubator/kubespray/commit/ad9049a49ed44104558ac8820f1ca3e51c29088b in my v2.5.0 working copy and it helped.
Any ideas how to proceed in the long term? @rzenker, how can the existing cluster be migrated to the new FQDN hostnames?
Please see my PR above. I think it wouldn't hurt making the calico nodename overridable.
With that change you can just set
kube_override_hostname: "{{ ansible_hostname }}"
calico_baremetal_nodename: "{{ ansible_hostname }}"
to keep the old names for calico nodes and kubelets.
I had the same the problem even with new installation via 2.5.0 version of kubespray on Baremetal (DO).
And this workaround fixed the problem.
As this seem fixed closing issue
Most helpful comment
Please see my PR above. I think it wouldn't hurt making the calico nodename overridable.
With that change you can just set
to keep the old names for calico nodes and kubelets.