I've been looking for information on how does microk8s handle failure scenarios such as partitioning (as per the CAP theorem), but so far haven't found any solid information.
So far the best information I've found is from the dqlite FAQ:
How does dqlite behave during conflict situations?
Does Raft select a winning WAL to write and any others in-flight writes are aborted?There can鈥檛 be a conflict situation. Raft鈥檚 model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with an ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).
When not enough nodes are available, are writes hung until consensus?
Yes, however, there鈥檚 a (configurable) timeout. This is a consequence of Raft sitting in the CP spectrum of the CAP theorem: in case of a network partition, it chooses consistency and sacrifices availability.
This would suggest that microk8s would also sacrifice availability in favor of ensuring consistency, however, the dqlite documentation also mentions a configurable timeout and I haven't found any other information on that.
The question
Assuming we had a 6 node HA microk8s cluster, that suffered an unfortunate long-term network issue splitting the cluster in half (3 nodes can reach each other on both sides of the network split), what would happen to the microk8s cluster after network connectivity has been restored to normal?
Hi @MythicManiac
Assuming we had a 6 node HA microk8s cluster, that suffered an unfortunate long-term network issue splitting the cluster in half (3 nodes can reach each other on both sides of the network split), what would happen to the microk8s cluster after network connectivity has been restored to normal?
The short answer. As soon as the cluster gets split one part will freeze while the other will continue working. When the two parts reconnect the frozen part will continue working using the datastore state of the part that continued working.
The long answer. By default MicroK8s uses dqlite to store the Kubernetes state. In dqlite there is a leader node that acts as the "gatekeeper" of all datastore requests and ensure the consistency of the data. The leader along with two more nodes maintain a copy of the datastore. If the the leader is unreachable the remaining two nodes have quorum to elect a new leader. When a split as the one you describe happens one of the voter nodes will be separated from the rest.
When the two parts reconnect the frozen one will be behind in the datastore log so it will update to the most up to date copy.
Thanks for the clarification @ktsakalozos. The way you explained it sounds like the design prevents more than a single active partition at a time even in failure scenarios outside of my example. Would this be a fair assumption to make?
So that would mean failures are limited to either taking down the entire cluster or a portion of it, but partitioning should not be a concern.
Most helpful comment
Hi @MythicManiac
The short answer. As soon as the cluster gets split one part will freeze while the other will continue working. When the two parts reconnect the frozen part will continue working using the datastore state of the part that continued working.
The long answer. By default MicroK8s uses dqlite to store the Kubernetes state. In dqlite there is a leader node that acts as the "gatekeeper" of all datastore requests and ensure the consistency of the data. The leader along with two more nodes maintain a copy of the datastore. If the the leader is unreachable the remaining two nodes have quorum to elect a new leader. When a split as the one you describe happens one of the voter nodes will be separated from the rest.
When the two parts reconnect the frozen one will be behind in the datastore log so it will update to the most up to date copy.