If Calico loses connection to the datastore, and the calico/node container restarts, it should start BGP using the latest configuration it received from the datastore.
If Calico loses connection to the datastore, and the calico/node container restarts, it will fail to get configuration from the datastore and thus fail to start BGP. This definitely affects confd/BIRD, and likely affects GoBGP as well.
The problem is that eventually BGP connections will time out and routes will be withdrawn from other BGP speakers, so routing will stop working for already-running pods/containers/vms.
For BIRD at least, we probably need:
startup.env to the host's filesystem.startup.go fails to access the datastore.or something like that.
Not sure what would be required to make this work in GoBGP.
calico/node instance from accessing etcdthe initial prototype looks something like this:
curl -m 5 -L http://172.17.8.101:2379/version
curl: (28) Connection timed out after 5001 milliseconds
Skipping datastore connection test
ERROR: Unable to access datastore to query node configuration
Terminating
[ERROR] Calico node failed to start normally...
[WARN] Attempting to start Calico node in failsafe mode
[WARN] Started Calico node in failsafe mode.
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 172.17.8.102 | node-to-node mesh | up | 21:45:28 | Established |
+--------------+-------------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
meep!
Getting BIRD to run even if startup fails might be a simple matter of changing this line: https://github.com/projectcalico/node/blob/v3.4.0-0.dev/filesystem/etc/rc.local#L13
I suspect we'll want to return a specific error code if it fails due to datastore reachability errors, so we can be precise in when we want to actually start BIRD or perform other actions in that script.
Then, it is simply a matter of mounting in the rendered templates from the host system. We can probably store them on the host in a /var/lib/calico/ subdirectory.
One thing to note - we still want to block rolling update when this occurs, otherwise we might roll out a bad config. I think Felix will fail to connect to the datastore and thus not be ready, but we might want to update the -bird-ready option so that we can tell if we're running off of "up to date" config or old persisted config.
Most helpful comment
the initial prototype looks something like this:
meep!