Calico: Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established

Created on 3 Oct 2019 · 7Comments · Source: projectcalico/calico

Hi,
Calico Pod fails to start on one of the machines, and I can't figure out what the problem is

Current Behavior

calico-node Pod fails to start on one of the nodes (node-1). It worked ok before, and then suddenly it stopped working

Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.3.67.92,10.3.67.932019-10-02 14:51:02.472 [INFO][170] readiness.go 88: Number of node(s) with BGP peering established = 0

node-1 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+---------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |  INFO   |
+--------------+-------------------+-------+----------+---------+
| 10.3.67.92   | node-to-node mesh | start | 14:07:10 | Passive |
| 10.3.67.93   | node-to-node mesh | start | 14:07:10 | Passive |
+--------------+-------------------+-------+----------+---------+

IPv6 BGP status
No IPv6 peers found.

node-2 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91   | node-to-node mesh | start | 10:42:13 | Active Socket: Connection      |
|              |                   |       |          | reset by peer                  |
| 10.3.67.93   | node-to-node mesh | up    | 10:42:14 | Established                    |
+--------------+-------------------+-------+----------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

node-3 # calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+--------------+-------------------+-------+----------+--------------------------------+
| 10.3.67.91   | node-to-node mesh | start | 10:42:10 | Active Socket: Connection      |
|              |                   |       |          | reset by peer                  |
| 10.3.67.92   | node-to-node mesh | up    | 10:42:14 | Established                    |
+--------------+-------------------+-------+----------+--------------------------------+

IPv6 BGP status
No IPv6 peers found.

Your Environment

We have 3 virtual machines installed with libvirt, running on 2 physical machines:
- node-1 on physical machine nuc2
- node-2 & node-3 on physical machine nuc1
all physical and virtual machines run Ubuntu 18.04.3 LTS
- nuc1 has a fresh install of Ubuntu 18.04.3 LTS
- nuc2 had Ubuntu 16.04 & was upgraded to Ubuntu 18.04.3 LTS
No firewall activated on port 179
Calico is installed with Kubespray ansible playbook v2.11.0
Calico version: v3.7.3
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.15.3
ETCD version: v3.3.15
Operating System and version: Ubuntu 18.04.3 LTS

Thank you

kinsupport

Source

rami-abu

Most helpful comment

Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0
Your hint about NATing was essential. Thank you!

rami-abu on 18 Oct 2019

🎉3

All 7 comments

I would suggest using nc (netcat) to connect from node-2 and -3 to node-1 on tcp port 179 using a -v flag to see if they can connect. You could also verify that node-1 can connect to itself.
If node-2 and -3 can't connect then I would see if there is general connectivity problems between node-1 and the others.

tmjd on 7 Oct 2019

Thank you.

node-1# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      9690/bird

node-2# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      6746/bird
tcp        0      0 10.3.67.92:179          10.3.67.93:37147        ESTABLISHED 6746/bird

node-3# netstat -anpt | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      9714/bird
tcp        0      0 10.3.67.93:37147        10.3.67.92:179          ESTABLISHED 9714/bird

First, did some nc tests on port 179, but since it's used, I did the same tests on another random port (5543) after that

On port 179

node-1|node-2|node-3:

# nc -l -v 179
nc: Address already in use

From all nodes to node-1:

# nc -vC node-1 179
Connection to node-1 179 port [tcp/bgp] succeeded!

Also from node-1 to all nodes

Connection to node-1 179 port [tcp/bgp] succeeded!
Connection to node-2 179 port [tcp/bgp] succeeded!
Connection to node-3 179 port [tcp/bgp] succeeded!

From node-2 to node-3 or vise-versa:

node-2:~# nc -vC node-3 179
Connection to node-3 179 port [tcp/bgp] succeeded!
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒=▒▒
C] @xA▒EF^C

node-3:~# nc -vC node-2 179
Connection to node-2 179 port [tcp/bgp] succeeded!
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒=▒▒
C\ @xA▒EF^C

On port 5543

Some environment-related clarifications:

nuc2 is the host machine on which node-1 is running
nuc2's ip is 10.3.67.222 which is in the same network with virtual machines (10.3.67.0/24)
when node-1 is listening, and I make a connection from node-2 or node-3 to it, it takes about 6 seconds till it displays Connection from --------- received!. The rest of the connections are instant.
if it helps, the DNS server which resolves node-x.local.lan is running on nuc2

root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-1 59626 received!

root@node-1:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!

---

root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 37306 received!

root@node-2:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!

---

root@node-1:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 49860 received!

root@node-3:~# nc -v -C node-1 5543
Connection to node-1 5543 port [tcp/*] succeeded!

root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from 10.3.67.222 40588 received!

root@node-1:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!

---

root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2.cluster.local 48760 received!

root@node-2:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!

---

root@node-2:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3.cluster.local 34556 received!

root@node-3:~# nc -v -C node-2 5543
Connection to node-2 5543 port [tcp/*] succeeded!

root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from local-nuc2.local.lan 42584 received!

root@node-1:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

---

root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-2 48078 received!

root@node-2:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

---

root@node-3:~# nc -v -l 5543
Listening on [0.0.0.0] (family 0, port 5543)
Connection from node-3 34314 received!

root@node-3:~# nc -v -C node-3 5543
Connection to node-3 5543 port [tcp/*] succeeded!

It seems weird that:

when connecting from node-1 to either of the nodes (2 or 3), it appears as if the host machine (nuc2) is sending the connection, not node-1
when connecting from node-2 or node-3 to node-1, it also appears as if the connection was received from nuc2

From calico-node Pod on node-1 logs:

bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35125)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 35113)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 57267)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 55655)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 41295)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47905)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 44403)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 60587)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 47445)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 48867)
bird: BGP: Unexpected connect from unknown address 10.3.67.222 (port 59261)

rami-abu on 8 Oct 2019

I agree that seems weird the way it is working. It seems like you've got some networking to figure out on nuc2. Sorry I don't think I can be much help there. Seems like some NAT'ing is happening on nuc2 that doesn't on nuc1.

tmjd on 9 Oct 2019

👍1

It's alright, I have to figure out the networking configuration on nuc2. Thanks for your help!

rami-abu on 10 Oct 2019

I'm going to close this issue for now. If you get the networking resolved and still having issues please open a new one or comment here and we can reopen.

tmjd on 10 Oct 2019

Found the culprit: a lazy generic masquerde rule:
MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0
Your hint about NATing was essential. Thank you!

rami-abu on 18 Oct 2019

🎉3

thanks a lot .. the following fixed my problem on all worker nodes (didn't run it on master)
firewall-cmd --permanent --add-port=5543/tcp --zone=public
firewall-cmd --permanent --add-port=179/tcp --zone=public
firewall-cmd --reload

(i am using virtualbox VMs to build a test 3 node k8s)

fileexit-ocp on 21 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Maintaining a desired number of route reflector nodes automatically

stoyanr · 3Comments

NodeNetworkUnavailable condition fix not included in v3.4.0 as release notes suggest

ffilippopoulos · 4Comments

Allow BGPPeer to be scoped on node labels

jpiper · 4Comments

Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cann0nf0dder · 5Comments

Add non-cluster host to Kubernets Calico network

AlejoAsd · 5Comments