Longhorn: Feature Request: Pause replica rebuild for server maintenance

Created on 3 Oct 2018  路  22Comments  路  Source: longhorn/longhorn

This is a follow up to https://forums.rancher.com/t/best-practice-when-rebooting-longhorn-hosts/11899

Issue:
When performing planned maintenance on a Longhorn node, as soon as the system is shut down, Longhorn will rebuild any replicas on the host onto a remaining host. In the case where the number of replicas is equal to the number of hosts, Longhorn will end up with two replicas on one host, and one on the other host. When the rebooted host comes back, Longhorn does not relocate the duplicated replica back to the rebooted host. The user has to manually delete the replica from the host that has two, and then Longhorn will rebuild it on the rebooted host.

Proposed solution:
Add an option similar to the CEPH command ceph osd set noout so that replicas won't be rebuilt if a server is shut down for maintenance. As Sheng pointed out, the replicas on the rebooted node will be out of sync and will still need to be rebuilt, but at least it will not have to replicate the data to another node, only to have it be deleted to get moved back.

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing

aremanager enhancement highlight priorit1 requirautomation-e2e requirdoc

Most helpful comment

Validated with longhorn-master 10/19/2020

Validation - Pass

Validated the feature.

  1. With 'Disable Replica Rebuild' enabled, the rebuild doesn't trigger in case of a node shut down/reboot, deletion of a replica/ increasing replica count/eviction of node/ data locality.
  2. With 'Disable Replica Rebuild' enabled, restore of backup, DR volume works, and rebuild happens.

All 22 comments

Similar issues here, three replicas on three nodes isn't fun. Why does Longhorn allow multiple replicas of the same volume to be scheduled on the same node at all? I'd very much prefer an option to stop this from happening over a maintenance mode that temporarily disables rescheduling.

@rbq Yeah, we are planning to add this support. In the meanwhile, you can temporarily Update Replica Count to n - 1 while doing the maintenance, which will instruct Longhorn to stop rebuilding during the period. After maintenance is done, you can increase the number back to the normal level, and prompt Longhorn to rebuild for the node.

586 should help with this.

Consider using this feature for maintenance in Longhorn v1.0.0.

We're planning to introduce a global setting replica-concurrent-rebuild-limit to limit the rebuild happening at the same time in the cluster.

  1. The default limit will be 10.
  2. The limit can be set to 0, so no replica rebuild will happen after that.
  3. It will not impact the rebuilding in process, means set the setting to 0 will not remove any of the replicas in the rebuilding process.

Pre-merged Checklist

  • [x] Does the PR include the explanation for the fix or the feature?
  • [x] Is the backend code merged?
  • [x] Is the reproduce steps/test steps documented?
  • [x] If labeled: area/ui Has the UI issue filed or ready to be merged?
  • [x] If labeled: require-automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
  • [x] if labeled: require-automation-engine Has the engine integration test been merged?
  • [x] if labeled: require-doc Has the necessary document PR submitted or merged?

Test steps are the same as the test case skeleton. No additional manual test required.

Verified on master - 05/18/2020

Steps:

Scenario#1

  • In a custom cluster - 5 worker nodes, cordon W1
  • Create 5 volumes with 4 replicas each. Use in 5 different workloads and write data into the volume.
  • Set Replica Rebuild Concurrent Limit to 3 in the Settings page
  • Uncordon W1.
  • Delete W2.
  • All the volumes will be in degraded state
  • 3 out of the 5 volumes are picked for replica rebuild.
  • When 1 finishes, 4th volume is picked for replica rebuild.
  • Data in all the volumes is intact

Scenario#2

  • Set Replica Rebuild Concurrent Limit to 2 in the Settings page
  • After scenario 1, Delete two worker nodes. Number of worker nodes now: 3
  • Volumes will be in degraded state. 2 replicas would have failed on all the volumes.
  • Add 2 more custom worker nodes in the cluster. Wait for the nodes to register.
  • On all the nodes, 1 replica is in Rebuilding process. Expected: only 2 replicas on 2. different nodes should be in Rebuilding process
  • when first replica is rebuilt on all the worker nodes, 1 replica each on 2 worker nodes are picked for rebuilding.
  • When 1 replica finishes rebuilding, now total - 4 replicas on 4 different nodes are in rebuilding state at the same time.

Attaching logs for info.
longhorn-support-bundle_c628da37-00af-427b-9bfa-a1716632defe_2020-05-19T03-35-45Z.zip

@sowmyav27 on scenario 2, what does On all the nodes, 1 replica is in Rebuilding process. mean? We have 4 nodes now, does that mean all four nodes are rebuilding (instead of suppose 2)? Also, I don't understand replicas on 2 worker nodes are picked for rebuilding. or When 1 finishes, the remaining 3 - now total - 4 replicas on 4 different nodes are in rebuilding state at the same time.

@sowmyav27 nvm, I've reproduced the issue. It seems due to the cache of Kubernetes lister on each node delayed a bit, result in a race condition while calculating the total number. Let me see if it works if we changed it to per-node limit.

@sowmyav27 It turns out it's hard to guarantee the concurrent rebuild count is no more than the setting due to multiple rebuilding process might start counting the replicas in the system at the same time. The updated version is better than the previous version, but I still able to get a slightly higher rebuild count sometimes.

We can focus on validating on set it to 0. In other cases, we shouldn't see a big difference between the limit and the actual rebuild number. The actual number should match or only slightly higher.

@sowmyav27 We will need to redo this feature. For now, we will only give the user option to disable replica rebuild across the cluster. Need to back to the whiteboard for the concurrent limit.

@shubb30 and @rbq , we are thinking this might be a duplication of this coming new feature rebuild replica with existing data. Since in the coming feature, we will have a parameter ReplicaRebuildWaitTime which will delay the rebuild if you need to bring the node down for a short period of time, and it won't trigger rebuild.

Please let us know what you think or what your use case if it doesn't fit your use case.

Thanks,
Bo

Hi Bo,

Unfortunately, after using Longhorn for a while, we found that it was not a good fit for us, so we have switched to CEPH for volume storage.

That being said, if the new parameter ReplicaRebuildWaitTime is what it sounds like, which is a time that you enter (i.e. 10 minutes) I still don't think that is what is needed. As I said in the original post, CEPH has an option to turn off, and on the rebuilding, which is what I believe is needed. I may not know how long the maintenance may take if I have 10 or 20 nodes that I need to work on. I would need to keep track of how long the maintenance is taking, and if I am nearing the end of my "wait time" and am not finished, then I would need to extend that time somehow. Also, if I finished the maintenance much sooner than the window, my cluster will still be in the "no rebuild" mode for longer than needed.

For me, I would still prefer a "Turn off rebuilding" and "Turn on rebuilding" switch so that I can decide exactly when I want the normal functionality to resume.

@shubb30 I am wondering what are the reasons that Longhorn doesn't fit well for your use case? It would be great if you can share it with us. We'd like to collect more feedbacks to improve Longhorn continuously.

@shubb30 , thank you for your response. And we're sorry that currently Longhorn doesn't fit your use case, and thank you for your feedback.

[鈥 in the coming feature, we will have a parameter ReplicaRebuildWaitTime which will delay the rebuild if you need to bring the node down for a short period of time [鈥 Please let us know what you think or what your use case if it doesn't fit your use case.

To me the proposed ReplicaRebuildWaitTime delay actually sounds more useful than an explicit maintenance mode. I'd like to automate node updates as far as possible, including automatic restarts using kured. From what I understand I could make Longhorn wait for the host to come back up for a reasonable timespan (say 30 minutes) and otherwise make sure the affected replicas find a new home. Sounds just perfect to me!

@rbq , thanks for your response. Update about this feature, we will still add this explicit stop the rebuild feature, and you can choose any one fit his use case the most. Thanks.

Note:

This 'Disable Replica Rebuild' setting will pause all the rebuild cross the cluster, the eviction and data locality feature won't work. But restore disaster recovery volume and currently rebuilding replicas would work as expected.

Couple test cases I have done:

Set 'Disable Replica Rebuild' to true for all the cases.

Eviction case:

  • Create a volume with 1 replica and attach it to node-1, and enabled eviction on node-1. Eviction won't happen.

Data locality case:

  • Create a volume with 1 replica and data locality best-effort (replica on node-2), and attach it to node-1. Data locality won't happen.

Rebuilding replica case:

  • When there is a rebuilding, set 'Disable Replica Rebuild' to true, and the rebuild will finish.

Restore DR volume:

  • Create a volume with 1 replica and attach it to node-1, write some data to it and take backup. Then delete the volume and create DR volume from the backup, and restore the DR volume. The restoration works fine.

Pre-merged Checklist

  • [x] Does the PR include the explanation for the fix or the feature?

  • [x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
    The PR is at https://github.com/longhorn/longhorn-manager/pull/710
    Add 'Disable Replica Rebuild' to default setting PR: https://github.com/longhorn/longhorn-manager/pull/718

  • [x] Is the reproduce steps/test steps documented?

  • [x] Which areas/issues this PR might have potential impacts on?
    Area eviction, data locality, rebuilding related features.
    Issues

  • [x] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

  • [x] If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at

  • [x] if labeled: require/doc Has the necessary document PR submitted or merged?
    The Doc issue/PR is at https://github.com/longhorn/website/pull/206

  • [x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
    The automation skeleton PR is at https://github.com/longhorn/longhorn-tests/pull/434
    The automation test case PR is at

  • [x] if labeled: require/automation-engine Has the engine integration test been merged?
    The engine automation PR is at

  • [x] if labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

Validated with longhorn-master 10/19/2020

Validation - Pass

Validated the feature.

  1. With 'Disable Replica Rebuild' enabled, the rebuild doesn't trigger in case of a node shut down/reboot, deletion of a replica/ increasing replica count/eviction of node/ data locality.
  2. With 'Disable Replica Rebuild' enabled, restore of backup, DR volume works, and rebuild happens.
Was this page helpful?
0 / 5 - 0 ratings