Microk8s: Question: How does HA microk8s handle or prevent partitioning?

Created on 11 Nov 2020 · 2Comments · Source: ubuntu/microk8s

I've been looking for information on how does microk8s handle failure scenarios such as partitioning (as per the CAP theorem), but so far haven't found any solid information.

So far the best information I've found is from the dqlite FAQ:

How does dqlite behave during conflict situations?
Does Raft select a winning WAL to write and any others in-flight writes are aborted?

There can’t be a conflict situation. Raft’s model is that only the leader can append new log entries, which translated to dqlite means that only the leader can write new WAL frames. So this means that any attempt to perform a write transaction on a non-leader node will fail with an ErrNotLeader error (and in this case clients are supposed to retry against whoever is the new leader).

When not enough nodes are available, are writes hung until consensus?
Yes, however, there’s a (configurable) timeout. This is a consequence of Raft sitting in the CP spectrum of the CAP theorem: in case of a network partition, it chooses consistency and sacrifices availability.

This would suggest that microk8s would also sacrifice availability in favor of ensuring consistency, however, the dqlite documentation also mentions a configurable timeout and I haven't found any other information on that.

The question
Assuming we had a 6 node HA microk8s cluster, that suffered an unfortunate long-term network issue splitting the cluster in half (3 nodes can reach each other on both sides of the network split), what would happen to the microk8s cluster after network connectivity has been restored to normal?

Source

MythicManiac

❤1 👍1

Most helpful comment

Hi @MythicManiac

Assuming we had a 6 node HA microk8s cluster, that suffered an unfortunate long-term network issue splitting the cluster in half (3 nodes can reach each other on both sides of the network split), what would happen to the microk8s cluster after network connectivity has been restored to normal?

The short answer. As soon as the cluster gets split one part will freeze while the other will continue working. When the two parts reconnect the frozen part will continue working using the datastore state of the part that continued working.

The long answer. By default MicroK8s uses dqlite to store the Kubernetes state. In dqlite there is a leader node that acts as the "gatekeeper" of all datastore requests and ensure the consistency of the data. The leader along with two more nodes maintain a copy of the datastore. If the the leader is unreachable the remaining two nodes have quorum to elect a new leader. When a split as the one you describe happens one of the voter nodes will be separated from the rest.

The part of the cluster with one voter will freeze because there is no way to contact the majority of the voters.
The part of the cluster with the two voters will continue working. As the majority of the voters is reachable a new leader can be elected, if needed. Another node will be promoted to a voter so we have three voters again.

When the two parts reconnect the frozen one will be behind in the datastore log so it will update to the most up to date copy.

ktsakalozos on 12 Nov 2020

👍3

All 2 comments

Hi @MythicManiac

Assuming we had a 6 node HA microk8s cluster, that suffered an unfortunate long-term network issue splitting the cluster in half (3 nodes can reach each other on both sides of the network split), what would happen to the microk8s cluster after network connectivity has been restored to normal?

The part of the cluster with one voter will freeze because there is no way to contact the majority of the voters.
The part of the cluster with the two voters will continue working. As the majority of the voters is reachable a new leader can be elected, if needed. Another node will be promoted to a voter so we have three voters again.

When the two parts reconnect the frozen one will be behind in the datastore log so it will update to the most up to date copy.

ktsakalozos on 12 Nov 2020

👍3

Thanks for the clarification @ktsakalozos. The way you explained it sounds like the design prevents more than a single active partition at a time even in failure scenarios outside of my example. Would this be a fair assumption to make?

So that would mean failures are limited to either taking down the entire cluster or a portion of it, but partitioning should not be a concern.

MythicManiac on 12 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings