We are running Calico 3.0.1 and experience the same issue from time to time. At some point one of the nodes becomes "broken" - all new pods don't have networking.
What happens:
1) felix watch connection(s) to etcd got stuck (we had etcd node hard reboot/reinstall this time)
2) user creates new pod
3) route and link added by CNI
4) felix sees new unknown route on interface
5) felix deletes new route after grace period as unknown
Felix is healthy and does not delete pods routes.
Felix deletes new pods routes as unknown.
As workaround i just did kill <felix pid> inside broken calico-node. Beforehand i spent hour to double check bgp, bird, ipip tunnels. Finally i can say with 100% confidence that issue localized in felix <> etcd connection.
As solution i think there can be some heathbeat check (above tcp heartbeat) that makes sure that watch connection is still alive. May be this bug even for etcd client driver.
See above, but there is no 100% way to reproduce this bug, we hit one at least 3 times and workaround was to restart broken node.
We are using following etcd endpoints in calico configuration https://10.0.5.101:2379,https://10.0.5.102:2379,https://10.0.5.103:2379.
calico-node logs when new pod created and then route removed
2018-01-23 13:11:48.552 [INFO][102] ipsets.go 253: Resyncing ipsets with dataplane. family="inet"
2018-01-23 13:11:48.552 [INFO][102] route_table.go 442: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:48.552 [INFO][102] route_table.go 298: Interface in cleanup grace period, will retry after. ifaceName="cali5129b6bd77a" ipVersion=0x4
...
2018-01-23 13:11:59.503 [INFO][102] route_table.go 446: Syncing routes: removing old route. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:59.503 [INFO][102] conntrack.go 78: Removing conntrack flows ip=192.168.189.183
netstat -nptu | grep 2379 on broken node
tcp 0 0 127.0.0.1:54408 127.0.0.1:2379 ESTABLISHED -
tcp 0 0 127.0.0.1:53158 127.0.0.1:2379 ESTABLISHED -
tcp 0 0 127.0.0.1:53496 127.0.0.1:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:38372 10.0.5.103:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:42572 10.0.5.101:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:39964 10.0.5.101:2379 ESTABLISHED -
tcp 0 0 127.0.0.1:57264 127.0.0.1:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:37464 10.0.5.101:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:42420 10.0.5.103:2379 ESTABLISHED 102/calico-felix
tcp 0 0 10.0.5.105:40714 10.0.5.101:2379 ESTABLISHED 99/confd
tcp 0 0 127.0.0.1:53498 127.0.0.1:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:51858 10.0.5.102:2379 ESTABLISHED -
tcp 0 0 10.0.5.105:42426 10.0.5.103:2379 ESTABLISHED 102/calico-felix
tcp 0 0 ::ffff:127.0.0.1:2379 ::ffff:127.0.0.1:57264 ESTABLISHED -
tcp 0 0 ::ffff:127.0.0.1:2379 ::ffff:127.0.0.1:54408 ESTABLISHED -
tcp 0 0 ::ffff:127.0.0.1:2379 ::ffff:127.0.0.1:53496 ESTABLISHED -
tcp 0 0 ::ffff:127.0.0.1:2379 ::ffff:127.0.0.1:53158 ESTABLISHED -
tcp 0 0 ::ffff:127.0.0.1:2379 ::ffff:127.0.0.1:53498 ESTABLISHED -
BGP peers and IPIP tunnels were healthy.
Also before node became "broken" i saw this from bird (10.0.5.103 is master/etcd that was reinstalled):
bird: Mesh_10_0_5_103: Error: Hold timer expired
bird: Mesh_10_0_5_103: State changed to stop
bird: Mesh_10_0_5_103: State changed to down
bird: Mesh_10_0_5_103: Starting
bird: Mesh_10_0_5_103: State changed to start
@r7vme thanks for raising. I think the root cause of this is in the etcdv3 client.
I've got a WIP PR which should hopefully fix this here: https://github.com/projectcalico/libcalico-go/pull/668
However, we need to wait until the etcd v3.3 client code is available, since that is the first release that includes gRPC keepalive support.
ack. thanks for the info.
@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/
For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.
@fasaxc that's a good point, we very well might need an upgraded server as well. We'll need to investigate.
If so, we'll need to decide:
As an addition to the 90s timeout workaround, would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them - perhaps having a particular filtered-out instance of each resource type (so that we don't actually generate unnecessary churn in Felix). Probably an overkill - but throwing it out there.
@robbrockbank Do we know if they're all multiplexed onto one TCP session? If so, refreshing the clusterinfo or something every 10s would do it.
JFYI: we are noticing this issue very frequently in test environment that has lots of node (etcd) reboots/crashes.
I would love to see workaround that will not require etcd 3.3.
@fasaxc : My understanding is that they are multiplexed onto a single connection with etcdv3. So doing this keep alive on a single resource type should handle the connection dropped issue.
That said, there are a couple of reasons though why doing a keep alive for each resource type might be preferable:
If we do want to do a keep alive, perhaps refreshing a WEP resource would be better since that is probably the one that would benefit most from not requiring a full resync.
would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them
I think we should avoid this - IIUC it either means writing a new component or making Felix write data to etcd, which changes its "read-only" behavior.
Ok, so I've managed to reproduce the issue where etcd connections remain in ESTABLISHED state after etcd has died (by doing a GCE instance reset). When the etcd instance comes back, the TCP connections remain in ESTABLISHED state.
By the way, this looks to be the same as this Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/46964
The fix for that is the same as what I had originally guessed the fix here would need to be, and was merged in this PR: https://github.com/kubernetes/kubernetes/pull/58008
I've got the Calico equivalent of that here: https://github.com/projectcalico/libcalico-go/pull/777
Running a build with that patch, I see the connections become properly re-established after I reset my etcd instance, so I think that fixes it. I've also tried this with an etcd server version of v3.1.10 and it behaves the same, so it looks like this will work without requiring an etcd upgrade, which is great.
What I don't see is any indication of the failure in Calico's logging - seems that the client handles the keepalive timeouts and reconnection under the covers without bubbling it up to the calling code. This is unfortunate, but pretty minor in comparison so I think let's move ahead with getting the keepalive patch into a bugfix release.
The fix for this is out in Calico v3.0.2 - please try it out and let me know if you hit any issues!
Big thanks, we'll let you know.
Most helpful comment
@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/
For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.