Although flannel will start renewing the lease an hour prior to expiration, it could still get lost: e.g. VM getting suspended. Flannel should try to get the same subnet assignment if it's still available but fall back to a new lease and signal the fact.
Is there any work under way for this? It'd be incredibly useful as right now if a machine loses a lease and gets a new one it renders any containers on the machine with no network connectivity.
One implementation idea for this is in #610
Also see #520 for some good questions about how flannel handles this at the moment.
When fixing this, we should make sure this failure scenario is discussed clearly in the docs.
FWIW, the system design that we've converged on for Cloud Foundry is that hosts are preferentially assigned their prior lease, even if it "expired." And if a new host appears, it is assigned a lease in the following priority order:
This is meant to minimize the probability that a lease is "stolen" from a live, but partitioned, container host. But if that does occur, once the partition heals and the "victim" host re-connects, it will discover that its lease is no longer valid. In this case, the victim host falls into a special, noisy failure mode which will (1) prevent any new workloads from being scheduled and (2) trigger the orchestration system to evacuate any existing workloads. Once the evacuation is complete, the host will clean up any leftover networking state (e.g. remove the VXLAN device), acquire a new lease for itself and begin accepting new workloads.
We think this is the right plan. Feedback welcome.
This is now fixed in v0.8.0
Most helpful comment
Is there any work under way for this? It'd be incredibly useful as right now if a machine loses a lease and gets a new one it renders any containers on the machine with no network connectivity.