Calico: Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established

Created on 3 Oct 2019  路  7Comments  路  Source: projectcalico/calico

Hi,
Calico Pod fails to start on one of the machines, and I can't figure out what the problem is

Current Behavior


calico-node Pod fails to start on one of the nodes (node-1). It worked ok before, and then suddenly it stopped working

Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.3.67.92,10.3.67.932019-10-02 14:51:02.472 [INFO][170] readiness.go 88: Number of node(s) with BGP peering established = 0

node-1 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+---------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |  INFO   |
+--------------+-------------------+-------+----------+---------+
| 10.3.67.92   | node-to-node mesh | start | 14:07:10 | Passive |
| 10.3.67.93   | node-to-node mesh | start | 14:07:10 | Passive |
+--------------+-------------------+-------+----------+---------+

IPv6 BGP status
No IPv6 peers found.
node-2 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91   | node-to-node mesh | start | 10:42:13 | Active Socket: Connection      |
|              |                   |       |          | reset by peer                  |
| 10.3.67.93   | node-to-node mesh | up    | 10:42:14 | Established                    |
+--------------+-------------------+-------+----------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

node-3 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91   | node-to-node mesh | start | 10:42:10 | Active Socket: Connection      |
|              |                   |       |          | reset by peer                  |
| 10.3.67.92   | node-to-node mesh | up    | 10:42:14 | Established                    |
+--------------+-------------------+-------+----------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

Your Environment

  • We have 3 virtual machines installed with libvirt, running on 2 physical machines:

    • node-1 on physical machine nuc2

    • node-2 & node-3 on physical machine nuc1

  • all physical and virtual machines run Ubuntu 18.04.3 LTS

    • nuc1 has a fresh install of Ubuntu 18.04.3 LTS

    • nuc2 had Ubuntu 16.04 & was upgraded to Ubuntu 18.04.3 LTS

  • No firewall activated on port 179
  • Calico is installed with Kubespray ansible playbook v2.11.0

  • Calico version: v3.7.3

  • Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.15.3
  • ETCD version: v3.3.15
  • Operating System and version: Ubuntu 18.04.3 LTS

Thank you

kinsupport

Most helpful comment

Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0
Your hint about NATing was essential. Thank you!

All 7 comments

I would suggest using nc (netcat) to connect from node-2 and -3 to node-1 on tcp port 179 using a -v flag to see if they can connect. You could also verify that node-1 can connect to itself.
If node-2 and -3 can't connect then I would see if there is general connectivity problems between node-1 and the others.

Thank you.

node-1# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      9690/bird

node-2# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      6746/bird
tcp        0      0 10.3.67.92:179          10.3.67.93:37147        ESTABLISHED 6746/bird

node-3# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      9714/bird
tcp        0      0 10.3.67.93:37147        10.3.67.92:179          ESTABLISHED 9714/bird

  • First, did some nc tests on port 179, but since it's used, I did the same tests on another random port (5543) after that

On port 179

node-1|node-2|node-3:

# nc -l -v 179
nc: Address already in use

From all nodes to node-1:

# nc -vC node-1 179
Connection to node-1 179 port [tcp/bgp] succeeded!

Also from node-1 to all nodes

Connection to node-1 179 port [tcp/bgp] succeeded!
Connection to node-2 179 port [tcp/bgp] succeeded!
Connection to node-3 179 port [tcp/bgp] succeeded!

From node-2 to node-3 or vise-versa:

node-2:~# nc -vC node-3 179
Connection to node-3 179 port [tcp/bgp] succeeded!
鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓=鈻掆枓
C] @xA鈻扙F^C

node-3:~# nc -vC node-2 179
Connection to node-2 179 port [tcp/bgp] succeeded!
鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓鈻掆枓=鈻掆枓
C\ @xA鈻扙F^C


On port 5543

Some environment-related clarifications:

  • nuc2 is the host machine on which node-1 is running
  • nuc2's ip is 10.3.67.222 which is in the same network with virtual machines (10.3.67.0/24)
  • when node-1 is listening, and I make a connection from node-2 or node-3 to it, it takes about 6 seconds till it displays Connection from --------- received!. The rest of the connections are instant.
  • if it helps, the DNS server which resolves node-x.local.lan is running on nuc2
root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-1 59626 received!

root@node-1:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!

---

root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 37306 received!

root@node-2:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!

---

root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 49860 received!

root@node-3:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!
root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from 10.3.67.222 40588 received!

root@node-1:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!

---

root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2.cluster.local 48760 received!

root@node-2:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!

---

root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3.cluster.local 34556 received!

root@node-3:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!
root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 42584 received!

root@node-1:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

---

root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2 48078 received!

root@node-2:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

---

root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3 34314 received!

root@node-3:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

It seems weird that:

  • when connecting from node-1 to either of the nodes (2 or 3), it appears as if the host machine (nuc2) is sending the connection, not node-1
  • when connecting from node-2 or node-3 to node-1, it also appears as if the connection was received from nuc2

From calico-node Pod on node-1 logs:

bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35125)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35113)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 57267)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 55655)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 41295)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47905)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 44403)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 60587)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47445)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 48867)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 59261)

I agree that seems weird the way it is working. It seems like you've got some networking to figure out on nuc2. Sorry I don't think I can be much help there. Seems like some NAT'ing is happening on nuc2 that doesn't on nuc1.

It's alright, I have to figure out the networking configuration on nuc2. Thanks for your help!

I'm going to close this issue for now. If you get the networking resolved and still having issues please open a new one or comment here and we can reopen.

Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0
Your hint about NATing was essential. Thank you!

thanks a lot .. the following fixed my problem on all worker nodes (didn't run it on master)
firewall-cmd --permanent --add-port=5543/tcp --zone=public
firewall-cmd --permanent --add-port=179/tcp --zone=public
firewall-cmd --reload

(i am using virtualbox VMs to build a test 3 node k8s)

Was this page helpful?
0 / 5 - 0 ratings