Hi,
Calico Pod fails to start on one of the machines, and I can't figure out what the problem is
calico-node Pod fails to start on one of the nodes (node-1). It worked ok before, and then suddenly it stopped working
Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.3.67.92,10.3.67.932019-10-02 14:51:02.472 [INFO][170] readiness.go 88: Number of node(s) with BGP peering established = 0
node-1 # calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+---------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+---------+
| 10.3.67.92 | node-to-node mesh | start | 14:07:10 | Passive |
| 10.3.67.93 | node-to-node mesh | start | 14:07:10 | Passive |
+--------------+-------------------+-------+----------+---------+
IPv6 BGP status
No IPv6 peers found.
node-2 # calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91 | node-to-node mesh | start | 10:42:13 | Active Socket: Connection |
| | | | | reset by peer |
| 10.3.67.93 | node-to-node mesh | up | 10:42:14 | Established |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.
node-3 # calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91 | node-to-node mesh | start | 10:42:10 | Active Socket: Connection |
| | | | | reset by peer |
| 10.3.67.92 | node-to-node mesh | up | 10:42:14 | Established |
+--------------+-------------------+-------+----------+--------------------------------+
IPv6 BGP status
No IPv6 peers found.
Calico is installed with Kubespray ansible playbook v2.11.0
Calico version: v3.7.3
Thank you
I would suggest using nc (netcat) to connect from node-2 and -3 to node-1 on tcp port 179 using a -v flag to see if they can connect. You could also verify that node-1 can connect to itself.
If node-2 and -3 can't connect then I would see if there is general connectivity problems between node-1 and the others.
Thank you.
node-1# netstat -anpt | grep 179
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 9690/bird
node-2# netstat -anpt | grep 179
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 6746/bird
tcp 0 0 10.3.67.92:179 10.3.67.93:37147 ESTABLISHED 6746/bird
node-3# netstat -anpt | grep 179
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 9714/bird
tcp 0 0 10.3.67.93:37147 10.3.67.92:179 ESTABLISHED 9714/bird
nc tests on port 179, but since it's used, I did the same tests on another random port (5543) after thatnode-1|node-2|node-3:
# nc -l -v 179
nc: Address already in use
From all nodes to node-1:
# nc -vC node-1 179
Connection to node-1 179 port [tcp/bgp] succeeded!
Also from node-1 to all nodes
Connection to node-1 179 port [tcp/bgp] succeeded!
Connection to node-2 179 port [tcp/bgp] succeeded!
Connection to node-3 179 port [tcp/bgp] succeeded!
From node-2 to node-3 or vise-versa:
node-2:~# nc -vC node-3 179
Connection to node-3 179 port [tcp/bgp] succeeded!
鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓=鈻掆枓
C] @xA鈻扙F^C
node-3:~# nc -vC node-2 179
Connection to node-2 179 port [tcp/bgp] succeeded!
鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓=鈻掆枓
C\ @xA鈻扙F^C
Some environment-related clarifications:
10.3.67.222 which is in the same network with virtual machines (10.3.67.0/24)Connection from --------- received!. The rest of the connections are instant.root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-1 59626 received!
root@node-1:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!
---
root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 37306 received!
root@node-2:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!
---
root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 49860 received!
root@node-3:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!
root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from 10.3.67.222 40588 received!
root@node-1:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!
---
root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2.cluster.local 48760 received!
root@node-2:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!
---
root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3.cluster.local 34556 received!
root@node-3:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!
root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 42584 received!
root@node-1:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!
---
root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2 48078 received!
root@node-2:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!
---
root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3 34314 received!
root@node-3:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!
It seems weird that:
From calico-node Pod on node-1 logs:
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35125)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35113)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 57267)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 55655)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 41295)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47905)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 44403)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 60587)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47445)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 48867)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 59261)
I agree that seems weird the way it is working. It seems like you've got some networking to figure out on nuc2. Sorry I don't think I can be much help there. Seems like some NAT'ing is happening on nuc2 that doesn't on nuc1.
It's alright, I have to figure out the networking configuration on nuc2. Thanks for your help!
I'm going to close this issue for now. If you get the networking resolved and still having issues please open a new one or comment here and we can reopen.
Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0
Your hint about NATing was essential. Thank you!
thanks a lot .. the following fixed my problem on all worker nodes (didn't run it on master)
firewall-cmd --permanent --add-port=5543/tcp --zone=public
firewall-cmd --permanent --add-port=179/tcp --zone=public
firewall-cmd --reload
(i am using virtualbox VMs to build a test 3 node k8s)
Most helpful comment
Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0Your hint about NATing was essential. Thank you!