Calico: Calico BGP should run without a datastore connection

Created on 19 Jul 2017  路  2Comments  路  Source: projectcalico/calico

Expected Behavior


If Calico loses connection to the datastore, and the calico/node container restarts, it should start BGP using the latest configuration it received from the datastore.

Current Behavior


If Calico loses connection to the datastore, and the calico/node container restarts, it will fail to get configuration from the datastore and thus fail to start BGP. This definitely affects confd/BIRD, and likely affects GoBGP as well.

The problem is that eventually BGP connections will time out and routes will be withdrawn from other BGP speakers, so routing will stop working for already-running pods/containers/vms.

Possible Solution


For BIRD at least, we probably need:

  • To persist the config files generated by confd, and maybe startup.env to the host's filesystem.
  • To modify startup so that BIRD starts even if startup.go fails to access the datastore.

or something like that.

Not sure what would be required to make this work in GoBGP.

Steps to Reproduce (for bugs)


  1. Prevent a calico/node instance from accessing etcd
  2. Restart that node instance
  3. Observe as BGP fails to start.
impachigh kinbug likelihoolow

Most helpful comment

the initial prototype looks something like this:

  1. etcd is not reachable
curl -m 5 -L http://172.17.8.101:2379/version
curl: (28) Connection timed out after 5001 milliseconds
  1. restart calico node with etcd down:
Skipping datastore connection test
ERROR: Unable to access datastore to query node configuration
Terminating
[ERROR] Calico node failed to start normally...
[WARN] Attempting to start Calico node in failsafe mode
[WARN] Started Calico node in failsafe mode.
  1. calico node can still exchange routes
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 172.17.8.102 | node-to-node mesh | up    | 21:45:28 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

meep!

All 2 comments

the initial prototype looks something like this:

  1. etcd is not reachable
curl -m 5 -L http://172.17.8.101:2379/version
curl: (28) Connection timed out after 5001 milliseconds
  1. restart calico node with etcd down:
Skipping datastore connection test
ERROR: Unable to access datastore to query node configuration
Terminating
[ERROR] Calico node failed to start normally...
[WARN] Attempting to start Calico node in failsafe mode
[WARN] Started Calico node in failsafe mode.
  1. calico node can still exchange routes
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 172.17.8.102 | node-to-node mesh | up    | 21:45:28 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

meep!

Getting BIRD to run even if startup fails might be a simple matter of changing this line: https://github.com/projectcalico/node/blob/v3.4.0-0.dev/filesystem/etc/rc.local#L13

I suspect we'll want to return a specific error code if it fails due to datastore reachability errors, so we can be precise in when we want to actually start BIRD or perform other actions in that script.

Then, it is simply a matter of mounting in the rendered templates from the host system. We can probably store them on the host in a /var/lib/calico/ subdirectory.

One thing to note - we still want to block rolling update when this occurs, otherwise we might roll out a bad config. I think Felix will fail to connect to the datastore and thus not be ready, but we might want to update the -bird-ready option so that we can tell if we're running off of "up to date" config or old persisted config.

Was this page helpful?
0 / 5 - 0 ratings