This is a follow up to https://forums.rancher.com/t/best-practice-when-rebooting-longhorn-hosts/11899
Issue:
When performing planned maintenance on a Longhorn node, as soon as the system is shut down, Longhorn will rebuild any replicas on the host onto a remaining host. In the case where the number of replicas is equal to the number of hosts, Longhorn will end up with two replicas on one host, and one on the other host. When the rebooted host comes back, Longhorn does not relocate the duplicated replica back to the rebooted host. The user has to manually delete the replica from the host that has two, and then Longhorn will rebuild it on the rebooted host.
Proposed solution:
Add an option similar to the CEPH command ceph osd set noout so that replicas won't be rebuilt if a server is shut down for maintenance. As Sheng pointed out, the replicas on the rebooted node will be out of sync and will still need to be rebuilt, but at least it will not have to replicate the data to another node, only to have it be deleted to get moved back.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing
Similar issues here, three replicas on three nodes isn't fun. Why does Longhorn allow multiple replicas of the same volume to be scheduled on the same node at all? I'd very much prefer an option to stop this from happening over a maintenance mode that temporarily disables rescheduling.
@rbq Yeah, we are planning to add this support. In the meanwhile, you can temporarily Update Replica Count to n - 1 while doing the maintenance, which will instruct Longhorn to stop rebuilding during the period. After maintenance is done, you can increase the number back to the normal level, and prompt Longhorn to rebuild for the node.
Consider using this feature for maintenance in Longhorn v1.0.0.
We're planning to introduce a global setting replica-concurrent-rebuild-limit to limit the rebuild happening at the same time in the cluster.
Pre-merged Checklist
Test steps are the same as the test case skeleton. No additional manual test required.
Filed https://github.com/longhorn/longhorn/issues/1348 for setting doc.
Verified on master - 05/18/2020
Steps:
Scenario#1
Replica Rebuild Concurrent Limit to 3 in the Settings pageScenario#2
Replica Rebuild Concurrent Limit to 2 in the Settings pageRebuilding process. Expected: only 2 replicas on 2. different nodes should be in Rebuilding processAttaching logs for info.
longhorn-support-bundle_c628da37-00af-427b-9bfa-a1716632defe_2020-05-19T03-35-45Z.zip
@sowmyav27 on scenario 2, what does On all the nodes, 1 replica is in Rebuilding process. mean? We have 4 nodes now, does that mean all four nodes are rebuilding (instead of suppose 2)? Also, I don't understand replicas on 2 worker nodes are picked for rebuilding. or When 1 finishes, the remaining 3 - now total - 4 replicas on 4 different nodes are in rebuilding state at the same time.
@sowmyav27 nvm, I've reproduced the issue. It seems due to the cache of Kubernetes lister on each node delayed a bit, result in a race condition while calculating the total number. Let me see if it works if we changed it to per-node limit.
@sowmyav27 It turns out it's hard to guarantee the concurrent rebuild count is no more than the setting due to multiple rebuilding process might start counting the replicas in the system at the same time. The updated version is better than the previous version, but I still able to get a slightly higher rebuild count sometimes.
We can focus on validating on set it to 0. In other cases, we shouldn't see a big difference between the limit and the actual rebuild number. The actual number should match or only slightly higher.
@sowmyav27 We will need to redo this feature. For now, we will only give the user option to disable replica rebuild across the cluster. Need to back to the whiteboard for the concurrent limit.
@shubb30 and @rbq , we are thinking this might be a duplication of this coming new feature rebuild replica with existing data. Since in the coming feature, we will have a parameter ReplicaRebuildWaitTime which will delay the rebuild if you need to bring the node down for a short period of time, and it won't trigger rebuild.
Please let us know what you think or what your use case if it doesn't fit your use case.
Thanks,
Bo
Hi Bo,
Unfortunately, after using Longhorn for a while, we found that it was not a good fit for us, so we have switched to CEPH for volume storage.
That being said, if the new parameter ReplicaRebuildWaitTime is what it sounds like, which is a time that you enter (i.e. 10 minutes) I still don't think that is what is needed. As I said in the original post, CEPH has an option to turn off, and on the rebuilding, which is what I believe is needed. I may not know how long the maintenance may take if I have 10 or 20 nodes that I need to work on. I would need to keep track of how long the maintenance is taking, and if I am nearing the end of my "wait time" and am not finished, then I would need to extend that time somehow. Also, if I finished the maintenance much sooner than the window, my cluster will still be in the "no rebuild" mode for longer than needed.
For me, I would still prefer a "Turn off rebuilding" and "Turn on rebuilding" switch so that I can decide exactly when I want the normal functionality to resume.
@shubb30 I am wondering what are the reasons that Longhorn doesn't fit well for your use case? It would be great if you can share it with us. We'd like to collect more feedbacks to improve Longhorn continuously.
@shubb30 , thank you for your response. And we're sorry that currently Longhorn doesn't fit your use case, and thank you for your feedback.
[鈥 in the coming feature, we will have a parameter
ReplicaRebuildWaitTimewhich will delay the rebuild if you need to bring the node down for a short period of time [鈥 Please let us know what you think or what your use case if it doesn't fit your use case.
To me the proposed ReplicaRebuildWaitTime delay actually sounds more useful than an explicit maintenance mode. I'd like to automate node updates as far as possible, including automatic restarts using kured. From what I understand I could make Longhorn wait for the host to come back up for a reasonable timespan (say 30 minutes) and otherwise make sure the affected replicas find a new home. Sounds just perfect to me!
@rbq , thanks for your response. Update about this feature, we will still add this explicit stop the rebuild feature, and you can choose any one fit his use case the most. Thanks.
Note:
This 'Disable Replica Rebuild' setting will pause all the rebuild cross the cluster, the eviction and data locality feature won't work. But restore disaster recovery volume and currently rebuilding replicas would work as expected.
Couple test cases I have done:
Set 'Disable Replica Rebuild' to true for all the cases.
Eviction case:
Data locality case:
best-effort (replica on node-2), and attach it to node-1. Data locality won't happen.Rebuilding replica case:
Restore DR volume:
[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/710
Add 'Disable Replica Rebuild' to default setting PR: https://github.com/longhorn/longhorn-manager/pull/718
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area eviction, data locality, rebuilding related features.
Issues
[x] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at https://github.com/longhorn/website/pull/206
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at https://github.com/longhorn/longhorn-tests/pull/434
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at
Validated with longhorn-master 10/19/2020
Validation - Pass
Validated the feature.
Most helpful comment
Validated with longhorn-master 10/19/2020
Validation - Pass
Validated the feature.