Etcd: feature request: support recovery from multiple nodes

Created on 30 Dec 2019  路  5Comments  路  Source: etcd-io/etcd

etcd currently provides --force-new-cluster to support recovering a cluster from a single node.

I'm wondering if it can support recovery from multiple nodes?
For example, when 3 nodes down in a 5 nodes cluster, we can recover from the remaining two nodes. The advantage of this is that you can retain as much of the committed data as possible. And sometimes, data loss can be completely avoided, for instance when 2 nodes down in a 4 node cluster (the remaining two nodes must have at least one containing the latest data).

I thought about it roughly, it seems that we can consider --initial-cluster along with --for-new-cluster, by only deleting members that are not included in initial-cluster, and keeping all uncommitted raftlog.

Please consider it, thank you! (If it can work, I'd like to help implement this feature.)

stale

Most helpful comment

If the two remaining nodes are inconsistent with one another, then you'll have to manually decide which one to trust - and at that point, you're essentially recovering just one node anyway.

I was thinking about this scenario. If it is allowed to start with 2 nodes, there is no need to manually determine which node has the latest data, because raft will automatically select the node with more data as the new leader.

All 5 comments

How does this scenario differ from recovering using just one node? If the data on the two nodes is consistent, then the data on one of either node should be consistent. So you could just discard one of the nodes and use the one-node procedure. If the two remaining nodes are inconsistent with one another, then you'll have to manually decide which one to trust - and at that point, you're essentially recovering just one node anyway. With the counter increasing with every write, there shouldn't ever be a situation where two nodes in a broken cluster have independently-valid writes that would need merged together, right?

Am I overlooking another possible scenario?

If the two remaining nodes are inconsistent with one another, then you'll have to manually decide which one to trust - and at that point, you're essentially recovering just one node anyway.

I was thinking about this scenario. If it is allowed to start with 2 nodes, there is no need to manually determine which node has the latest data, because raft will automatically select the node with more data as the new leader.

Ok. This should be the only way that situation occurs, right?

Given a 3-node cluster, a client writes to 1, and 2 agrees giving quorum. The write is then committed on 1 and 2. Say 2 goes down permanently before the write has happened on 3, and 1&3 also happen to go down at the same moment (or, at least before 3 is up-to-date). Maybe power fails on all three nodes, and node 2's disk is also coincidentally damaged. What needs to happen is that 3 should trust the "approved" write when 1 and 3 are brought back up and resume heartbeating together? I need to review RAFT, but I think:

  • If 1 is the leader, then 3 should trust that when they come back up and the cluster is just degraded
  • If 2 is the leader, then 1+3 would do a leader election within the degraded cluster
  • If 3 is the leader, then this wouldn't happen
  • If the cluster was more than 3 nodes, 1+2 are a minority.

    • It's time for disaster recovery if neither was leader.

    • If 1 was the leader, then there needs to be a procedure to reduce the size of the cluster to only the remaining live nodes, at which point quorum could be reached and the new small cluster resumes following the same old leader.

That last sub-bullet is the place where this procedure would be useful, right? Am I going astray with my thinking here?

The main point I propose is that if only single-node recovery is available, we must pick a node that contains as much data as possible from the remaining two nodes to recover. But if we can support recovery from multiple nodes, then we can start without judgment and let raft automatically select the appropriate leader.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings