Calico: felix connection to etcdv3 api stucks

Created on 23 Jan 2018 · 12Comments · Source: projectcalico/calico

We are running Calico 3.0.1 and experience the same issue from time to time. At some point one of the nodes becomes "broken" - all new pods don't have networking.

What happens:
1) felix watch connection(s) to etcd got stuck (we had etcd node hard reboot/reinstall this time)
2) user creates new pod
3) route and link added by CNI
4) felix sees new unknown route on interface
5) felix deletes new route after grace period as unknown

Expected Behavior

Felix is healthy and does not delete pods routes.

Current Behavior

Felix deletes new pods routes as unknown.

Possible Solution

As workaround i just did kill <felix pid> inside broken calico-node. Beforehand i spent hour to double check bgp, bird, ipip tunnels. Finally i can say with 100% confidence that issue localized in felix <> etcd connection.

As solution i think there can be some heathbeat check (above tcp heartbeat) that makes sure that watch connection is still alive. May be this bug even for etcd client driver.

Steps to Reproduce (for bugs)

See above, but there is no 100% way to reproduce this bug, we hit one at least 3 times and workaround was to restart broken node.

Context

We are using following etcd endpoints in calico configuration https://10.0.5.101:2379,https://10.0.5.102:2379,https://10.0.5.103:2379.

calico-node logs when new pod created and then route removed

2018-01-23 13:11:48.552 [INFO][102] ipsets.go 253: Resyncing ipsets with dataplane. family="inet"
2018-01-23 13:11:48.552 [INFO][102] route_table.go 442: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:48.552 [INFO][102] route_table.go 298: Interface in cleanup grace period, will retry after. ifaceName="cali5129b6bd77a" ipVersion=0x4
...
2018-01-23 13:11:59.503 [INFO][102] route_table.go 446: Syncing routes: removing old route. dest=192.168.189.183/32 ifaceName="cali5129b6bd77a" ipVersion=0x4
2018-01-23 13:11:59.503 [INFO][102] conntrack.go 78: Removing conntrack flows ip=192.168.189.183

netstat -nptu | grep 2379 on broken node

tcp        0      0 127.0.0.1:54408         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53158         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 127.0.0.1:53496         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:38372        10.0.5.103:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42572        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:39964        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 127.0.0.1:57264         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:37464        10.0.5.101:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42420        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 10.0.5.105:40714        10.0.5.101:2379         ESTABLISHED 99/confd
tcp        0      0 127.0.0.1:53498         127.0.0.1:2379          ESTABLISHED -
tcp        0      0 10.0.5.105:51858        10.0.5.102:2379         ESTABLISHED -
tcp        0      0 10.0.5.105:42426        10.0.5.103:2379         ESTABLISHED 102/calico-felix
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:57264  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:54408  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53496  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53158  ESTABLISHED -
tcp        0      0 ::ffff:127.0.0.1:2379   ::ffff:127.0.0.1:53498  ESTABLISHED -

BGP peers and IPIP tunnels were healthy.

Also before node became "broken" i saw this from bird (10.0.5.103 is master/etcd that was reinstalled):

bird: Mesh_10_0_5_103: Error: Hold timer expired
bird: Mesh_10_0_5_103: State changed to stop
bird: Mesh_10_0_5_103: State changed to down
bird: Mesh_10_0_5_103: Starting
bird: Mesh_10_0_5_103: State changed to start

Your Environment

Calico version 3.0.1
Kubernetes 1.7.11
Operating System and version: CoreOS 1576.4.0

Source

r7vme

❤3

Most helpful comment

@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/

For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.

fasaxc on 25 Jan 2018

👍4

All 12 comments

@r7vme thanks for raising. I think the root cause of this is in the etcdv3 client.

I've got a WIP PR which should hopefully fix this here: https://github.com/projectcalico/libcalico-go/pull/668

However, we need to wait until the etcd v3.3 client code is available, since that is the first release that includes gRPC keepalive support.

caseydavenport on 24 Jan 2018

ack. thanks for the info.

r7vme on 24 Jan 2018

@caseydavenport DYK if the server needs to be upgraded to get keep-alive support? If so, we might want to consider a workaround :-/

For example, reset a watchdog timer every time we get an event from the server and then pro-actively restart the connection if we don't see any events for 90s.

fasaxc on 25 Jan 2018

👍4

@fasaxc that's a good point, we very well might need an upgraded server as well. We'll need to investigate.

If so, we'll need to decide:

Write some workaround ourselves
Notify users that v3.3+ of etcd is required.

caseydavenport on 25 Jan 2018

As an addition to the 90s timeout workaround, would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them - perhaps having a particular filtered-out instance of each resource type (so that we don't actually generate unnecessary churn in Felix). Probably an overkill - but throwing it out there.

robbrockbank on 29 Jan 2018

@robbrockbank Do we know if they're all multiplexed onto one TCP session? If so, refreshing the clusterinfo or something every 10s would do it.

fasaxc on 30 Jan 2018

JFYI: we are noticing this issue very frequently in test environment that has lots of node (etcd) reboots/crashes.

I would love to see workaround that will not require etcd 3.3.

r7vme on 31 Jan 2018

👍1

@fasaxc : My understanding is that they are multiplexed onto a single connection with etcdv3. So doing this keep alive on a single resource type should handle the connection dropped issue.

That said, there are a couple of reasons though why doing a keep alive for each resource type might be preferable:

We'll have recent revisions for each resource type that will allow us to start the watcher using the revision of the last received update, thus reducing the chance that we'll need to do a full resync on a particular resource type.
When using Kubernetes API as the datastore, K8s automatically closes a CRD watcher after 1min of inactivity, presumably to work around this problem. This would help keep all CRD watchers active.

If we do want to do a keep alive, perhaps refreshing a WEP resource would be better since that is probably the one that would benefit most from not requiring a full resync.

robbrockbank on 31 Jan 2018

would it be worth considering implementing a lightweight load generator that just keeps the watchers "active" so that we don't need to restart them

I think we should avoid this - IIUC it either means writing a new component or making Felix write data to etcd, which changes its "read-only" behavior.

caseydavenport on 1 Feb 2018

Ok, so I've managed to reproduce the issue where etcd connections remain in ESTABLISHED state after etcd has died (by doing a GCE instance reset). When the etcd instance comes back, the TCP connections remain in ESTABLISHED state.

By the way, this looks to be the same as this Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/46964

The fix for that is the same as what I had originally guessed the fix here would need to be, and was merged in this PR: https://github.com/kubernetes/kubernetes/pull/58008

I've got the Calico equivalent of that here: https://github.com/projectcalico/libcalico-go/pull/777

Running a build with that patch, I see the connections become properly re-established after I reset my etcd instance, so I think that fixes it. I've also tried this with an etcd server version of v3.1.10 and it behaves the same, so it looks like this will work without requiring an etcd upgrade, which is great.

What I don't see is any indication of the failure in Calico's logging - seems that the client handles the keepalive timeouts and reconnection under the covers without bubbling it up to the calling code. This is unfortunate, but pretty minor in comparison so I think let's move ahead with getting the keepalive patch into a bugfix release.

caseydavenport on 6 Feb 2018

👍1

The fix for this is out in Calico v3.0.2 - please try it out and let me know if you hit any issues!

caseydavenport on 12 Feb 2018

Big thanks, we'll let you know.

r7vme on 13 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cannot access kube-api (with Service-IP) from a pod in default namespace

wjentner · 5Comments

Add non-cluster host to Kubernets Calico network

AlejoAsd · 5Comments

Calico 3.2.1 with K8s 1.11.1: Readiness probe failed

mrsherlock88 · 3Comments

Calico Network Policy only works when both application are on same K8s node

venomwaqar · 5Comments

Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cann0nf0dder · 5Comments