Kubespray: Scale does not work, cni NetworkNotReady

Created on 2 Mar 2019  路  20Comments  路  Source: kubernetes-sigs/kubespray

BUG REPORT

Install Kubernetes cluster with 2 worker nodes, then add one more node, using scale.yml

After finishing ansible playbook, you can see that the new node is NotReady,
journalctl -xeu kubelet on that specific node tells, NetworkNotReady ..

After digging into another node the file
/etc/cni/net.d/10-calico.conflist
was different on new node/old node...

Environment:

  • Cloud provider or hardware configuration:
    OpenStack
  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Ubuntu 18.04

  • Version of Ansible (ansible --version):
    2.7.5

Kubespray version (commit) (git rev-parse --short HEAD):
master/ adf6a712

Network plugin used:
calico / v3.4.0

Copy of your inventory file:
[all]
k8s-master-1 ansible_ssh_host=192.168.16.211 access_ip=192.168.16.211
k8s-master-2 ansible_ssh_host=192.168.16.212 access_ip=192.168.16.212
k8s-master-3 ansible_ssh_host=192.168.16.213 access_ip=192.168.16.213
k8s-node-1 ansible_ssh_host=192.168.16.215 access_ip=192.168.16.215
k8s-node-2 ansible_ssh_host=192.168.16.216 access_ip=192.168.16.216
k8s-node-3 ansible_ssh_host=192.168.16.217 access_ip=192.168.16.217

[kube-master]
k8s-master-1
k8s-master-2
k8s-master-3

[kube-master:vars]
vip_address=192.168.16.210

[etcd]
k8s-master-1
k8s-master-2
k8s-master-3

[kube-node]
k8s-node-1
k8s-node-2
k8s-node-3

[k8s-cluster:children]
kube-master
kube-node

Command used to invoke ansible:
ansible-playbook -i inventory/mine/hosts.ini -u ubuntu -b scale.yml

Output of ansible run:
playbook succeded

Anything else do we need to know:

I thought calico_version is not defined well when the playbook runs but I added a Debug:
task for calico config

- name: Calico | Write Calico cni config
  template:
    src: "cni-calico.conflist.j2"
    dest: "/etc/cni/net.d/{% if calico_version is version('v3.3.0', '>=') %}calico.conflist.template{% else %}10-calico.conflist{% endif %}"
    owner: kube

Debug version

- name: Debug Calico Version
  debug:
    msg: "Debug calico_version : {{calico_version}} , {{calico_version is version('v3.3.0', '>=') }}"

output is :

ok: [k8s-node-3] => {
    "msg": "Debug calico_version : v3.4.0 , True"
}
lifecyclstale

Most helpful comment

I have found out that if I delete the calico pod from the new node (restart node/ restart calico pod/container on that node) the node becomes healthy and the proper file 10-calico.conflist is put in place.

All 20 comments

File on newest node
~~~
{
"name": "k8s-pod-network",
"type": "calico",
"etcd_endpoints": "",
"etcd_key_file": "",
"etcd_cert_file": "",
"etcd_ca_cert_file": "",
"log_level": "warn",
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s",
"k8s_api_root": "https://10.233.0.1:443",
"k8s_auth_token": "eyJXxxxxxxxxxxxx"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}

~~~

File on a working node
~
{
"name": "cni0",
"cniVersion":"0.3.1",
"plugins":[
{
"nodename": "k8s-node-3",
"type": "calico",
"etcd_endpoints": "https://192.168.16.211:2379,https://192.168.16.212:2379,https://192.168.16.213:2379",
"etcd_cert_file": "/etc/calico/certs/cert.crt",
"etcd_key_file": "/etc/calico/certs/key.pem",
"etcd_ca_cert_file": "/etc/calico/certs/ca_cert.crt",
"log_level": "info",
"ipam": {
"type": "calico-ipam",
"assign_ipv4": "true",
"ipv4_pools": ["10.233.64.0/18"]
},
"policy": {
"type": "k8s"
}, "kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type":"portmap",
"capabilities":{
"portMappings":true
}
}
]
}
~

I use calico with the following settings
~~~

peer_with_router: true
peers: []

~~~

And I manually set up routes in openstack router.

Kubectl describe bew node
~
SufficientPID kubelet has sufficient PID available
Ready False Mon, 04 Mar 2019 08:32:05 +0000 Mon, 04 Mar 2019 08:29:04 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
~

Kubelet logs

W0304 08:52:20.254027   18127 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
W0304 08:52:20.254057   18127 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
E0304 08:52:20.254191   18127 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
W0304 08:52:25.255735   18127 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
W0304 08:52:25.255774   18127 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
E0304 08:52:25.255961   18127 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I0304 08:52:27.453683   18127 setters.go:72] Using node IP: "192.168.16.218"
W0304 08:52:30.265524   18127 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
W0304 08:52:30.265548   18127 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
E0304 08:52:30.265719   18127 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

I have found out that if I delete the calico pod from the new node (restart node/ restart calico pod/container on that node) the node becomes healthy and the proper file 10-calico.conflist is put in place.

I'm having the same problem

Same error here when adding a new node with scale.yml:

W0327 18:51:54.542772   59070 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
W0327 18:51:54.542789   59070 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
E0327 18:51:54.542854   59070 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin

Can confirm that manually deleting the calico-node pod on affected nodes fixes it.

Same issue here, same fix for me.

Cloud Provider: Packet
OS: CentOS 7
Ansible: 2.7.10

Just hit this problem too.

We've same issue scaling but also upgrading calico from v3.1.3 to v3.4.0 with upgrade-cluster.yml
This only happens with the nodes, the masters are updated ok

Apr 15 14:22:45 caas-xavi-dev-lb-01 kubelet[21136]: W0415 14:22:45.150469   21136 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
Apr 15 14:22:45 caas-xavi-dev-lb-01 kubelet[21136]: W0415 14:22:45.150495   21136 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
Apr 15 14:22:45 caas-xavi-dev-lb-01 kubelet[21136]: E0415 14:22:45.150609   21136 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Manually deleting the calico-node pod fixes the problem

Before deleting the pod:

ls -l /etc/cni/net.d
total 12
-rw-rw-r-- 1 root root 1315 Apr 15 14:04 10-calico.conflist
-rw------- 1 root root 2566 Apr 15 14:04 calico-kubeconfig
-rw-r--r-- 1 kube root  801 Apr 15 14:06 calico.conflist.template

After deleting the pod:

ls -l /etc/cni/net.d
total 12
-rw-r--r-- 1 root root  810 Apr 15 15:23 10-calico.conflist
-rw------- 1 root root 2566 Apr 15 15:23 calico-kubeconfig
-rw-r--r-- 1 kube root  801 Apr 15 14:06 calico.conflist.template

@mirwan @wangxf1987 do you think this can be related with #4102 and the new template name? https://github.com/kubernetes-sigs/kubespray/pull/4102/files#diff-f336e04badfa2e647399ccaeed760b13R5

Also having the issue with improper config in /etc/cni/net.d/10-canal.conflist (having kube_network_plugin: canal). Restarting canal-node-xxxxx pod solves the issue.

kubespray version: 2.10.0

We just hit the problem as well here, upgrading from 2.8.5 to 2.9.0 with canal.

Having the same problem here. Deleting them works fine for me too, posted my command to delete them to give y'all a shortcut.

sudo docker ps --all --filter name=.*calico.* --no-trunc --format {{.ID}} | xargs sudo docker rm -f

also hit this issue, after delete calico pod running in new worker node, cannot fix this issue.

I'm hitting the same issue while upgrading from kubespray 2.8.x to 2.9.0 and from k8s 1.12.9 to 1.13.5.
Deleting pods is solving the problem.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Was this page helpful?
0 / 5 - 0 ratings