etcd currently provides --force-new-cluster to support recovering a cluster from a single node.
I'm wondering if it can support recovery from multiple nodes?
For example, when 3 nodes down in a 5 nodes cluster, we can recover from the remaining two nodes. The advantage of this is that you can retain as much of the committed data as possible. And sometimes, data loss can be completely avoided, for instance when 2 nodes down in a 4 node cluster (the remaining two nodes must have at least one containing the latest data).
I thought about it roughly, it seems that we can consider --initial-cluster along with --for-new-cluster, by only deleting members that are not included in initial-cluster, and keeping all uncommitted raftlog.
Please consider it, thank you! (If it can work, I'd like to help implement this feature.)
How does this scenario differ from recovering using just one node? If the data on the two nodes is consistent, then the data on one of either node should be consistent. So you could just discard one of the nodes and use the one-node procedure. If the two remaining nodes are inconsistent with one another, then you'll have to manually decide which one to trust - and at that point, you're essentially recovering just one node anyway. With the counter increasing with every write, there shouldn't ever be a situation where two nodes in a broken cluster have independently-valid writes that would need merged together, right?
Am I overlooking another possible scenario?
If the two remaining nodes are inconsistent with one another, then you'll have to manually decide which one to trust - and at that point, you're essentially recovering just one node anyway.
I was thinking about this scenario. If it is allowed to start with 2 nodes, there is no need to manually determine which node has the latest data, because raft will automatically select the node with more data as the new leader.
Ok. This should be the only way that situation occurs, right?
Given a 3-node cluster, a client writes to 1, and 2 agrees giving quorum. The write is then committed on 1 and 2. Say 2 goes down permanently before the write has happened on 3, and 1&3 also happen to go down at the same moment (or, at least before 3 is up-to-date). Maybe power fails on all three nodes, and node 2's disk is also coincidentally damaged. What needs to happen is that 3 should trust the "approved" write when 1 and 3 are brought back up and resume heartbeating together? I need to review RAFT, but I think:
That last sub-bullet is the place where this procedure would be useful, right? Am I going astray with my thinking here?
The main point I propose is that if only single-node recovery is available, we must pick a node that contains as much data as possible from the remaining two nodes to recover. But if we can support recovery from multiple nodes, then we can start without judgment and let raft automatically select the appropriate leader.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
I was thinking about this scenario. If it is allowed to start with 2 nodes, there is no need to manually determine which node has the latest data, because raft will automatically select the node with more data as the new leader.