Kubespray: Upgrade breaks calico: Node hostnames change

Created on 9 May 2018 · 4Comments · Source: kubernetes-sigs/kubespray

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Environment:

Cloud provider or hardware configuration: Bare Metal

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Debian Stretch
Version of Ansible (ansible --version): 2.4.1.0

Kubespray version (commit) (git rev-parse --short HEAD): 2.5.0

Network plugin used: calico

Command used to invoke ansible: ansible-playbook upgrade-cluster.yml

Output of ansible run: Succeeded

Anything else do we need to know:

I upgraded from Kubespray version 2.3.0 to 2.5.0 by using the upgrade-cluster.yml playbook. Everything went fine, but all calico pods moved into CrashLoopBackOff. The logs were showing the following:

Calico node 'node1' is already using the IPv4 address <IP>.

After some investigation I found out that the upgrade had changed the node names from node1 to node1.domain.com, i.e. from pure hostname to FQDN. But the calico nodes where still registered with the old names.

A kubectl get nodes also showed both versions of the nodes, with pure hostname and with FQDN, the hostname variants had the before-upgrade kubelet version and state "NotReady". Naively, I did a kubectl delete node on all the old nodes. That, of course, didn't fix the problem.

After some more investigation I tried running calicoctl to delete the old nodes from etcd. However, the old nodes didn't show up in calicoctl get nodes, only the new ones.

What finally solved the problem for me: I changed back the node names from FQDN to hostname only in /etc/kubernetes/kubelet.env and /etc/cni/net.d/10-calico.conflist and restarted the kubelet on all nodes. The calico pods started up again.

I'm pretty sure this commit caused the problem for me: https://github.com/kubernetes-incubator/kubespray/commit/ad9049a49ed44104558ac8820f1ca3e51c29088b

Exchanging ansible_hostname with inventory_hostname changed my node names to FQDN, like they are defined in my inventory.

Source

danielm0hr

👍1

Most helpful comment

Please see my PR above. I think it wouldn't hurt making the calico nodename overridable.

With that change you can just set

kube_override_hostname: "{{ ansible_hostname }}"
calico_baremetal_nodename: "{{ ansible_hostname }}"

to keep the old names for calico nodes and kubelets.

danielm0hr on 18 May 2018

👍3

All 4 comments

Thank you @danielm0hr, for pointing to the right direction and providing the workaround. I have "reverted" the changes from https://github.com/kubernetes-incubator/kubespray/commit/ad9049a49ed44104558ac8820f1ca3e51c29088b in my v2.5.0 working copy and it helped.
Any ideas how to proceed in the long term? @rzenker, how can the existing cluster be migrated to the new FQDN hostnames?

opusmagnum on 17 May 2018

Please see my PR above. I think it wouldn't hurt making the calico nodename overridable.

With that change you can just set

kube_override_hostname: "{{ ansible_hostname }}"
calico_baremetal_nodename: "{{ ansible_hostname }}"

to keep the old names for calico nodes and kubelets.

danielm0hr on 18 May 2018

👍3

I had the same the problem even with new installation via 2.5.0 version of kubespray on Baremetal (DO).
And this workaround fixed the problem.

mr-yaky on 7 Jun 2018

As this seem fixed closing issue

Atoms on 8 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to force updating certs when updating supplementary_addresses_in_ssl_keys?

servo1x · 4Comments

using limit tag to add a node into current cluster

zouyee · 3Comments

inventory_builder: ValueError: too many values to unpack (expected 2)

modir · 4Comments

Feature Request - Configure private docker registries

sermilrod · 4Comments

Error with Calico wait for etcd

mattdornfeld · 4Comments