Describe the bug
When the node is back from the network disconnections within a short period(5~6mins), all pods on the node won't be evicted. Then some unexpected issues will be triggered.
Scenario 1 (reproduce/always):
The attached volume will become faulted, then get salvaged and reattached if the setting auto-salvage is enabled.
Scenario 2 (reproduce/often):
The volume engines on the recovered node won't be updated. This means the volumes on the node cannot be attached/detached/deleted successfully and get stuck there.
To Reproduce
Steps for Scenario 1:
auto-salvage.Faulted.Steps for Scenario 2:
Attached or Attaching state rather than becoming Healthy. Attaching state.Expected behavior
For Scenario 1:
The volume should be detached and reattached automatically and the replicas on the running nodes should be fine.
For Scenario 2:
The volumes on the recovered node should work fine.
Environment:
Additional context
When the node is disconnected, the related engine manager will wrongly mark all replicas for all running engine processes as ERR due to the Ping-timeout error. After the engine manager pod is back, longhorn managers will get those wrong engine processes status then lead to the volumes becoming Faulted.
Scenario 2 may be the cause of the unexpected validation failures mentioned in #1327, #1536, #1451
For Scenario 1, we can directly delete the related pod if the instance manager object becomes Unknown. Then longhorn manager pods won't accept wrong engine/replica info from the instance manager pod on the back node.
Scenario 2 is caused by the unexpected deadlock in the engine controller after the node is back:
https://github.com/longhorn/longhorn-manager/blob/3879d122665f08841979516bfeb5ada6bcdaec14/controller/engine_controller.go#L313
Here is how the deadlock gets triggered:
vol1-e-89a70feb becomes Unknown due to the network disconnection.time="2020-07-07T08:03:50Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"shuo-longhorn-cluster-1-worker-1\", UID:\"74ffe5f5-bdbe-43cf-9ee1-042d11c9253f\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"408815\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node shuo-longhorn-cluster-1-worker-1 not ready: NodeStatusUnknown"
time="2020-07-07T08:03:50Z" level=debug msg="Instance handler updated instance vol1-e-89a70feb state, old state running, new state unknown"
md5-d2c4f1e2baaca332e7e8aa4234a80f92
time="2020-07-07T08:03:50Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Succeed to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Complete stopMonitoring for engine vol1-e-89a70feb"
md5-010eaf310489763d368b77019708fd56
time="2020-07-07T08:03:55Z" level=debug msg="Engine monitor vol1-e-89a70feb starts a loop"
md5-066db99ed5fd9b3e65d20ad8811cca2c
time="2020-07-07T08:03:55Z" level=info msg="stop engine vol1-e-89a70feb monitoring because the engine is no longer running on node shuo-longhorn-cluster-1-worker-1"
......
time="2020-07-07T08:03:56Z" level=debug msg="Prepare to send signal to the channel in stop for engine vol1-e-89a70feb"
md5-f8d0306ba5cb53ca6b3a3ae1102fd690
E0707 08:04:49.921708 1 engine_controller.go:614] failed to update engine vol1-e-89a70feb to stop monitoring in stopMonitoring func: failed to reset engine status for vol1-e-89a70feb: Operation cannot be fulfilled on engines.longhorn.io "vol1-e-89a70feb": the object has been modified; please apply your changes to the latest version and try again
......
time="2020-07-07T08:04:52Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:04:52Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
[Updated v3]The solution for the deadlock:
wait.Until, we don't need to use the stop channel inside the engine monitor goroutine. Instead, the sync loop can return a boolean to indicate if the engine monitor needs to be stopped. (Using channel between callee (engine monitor sync loop fund) and the caller (wait.Until) is weird.)@shuo-wu great analysis :)
Steps for Scenario 1:
1. Launch a Longhorn system.
2. Disable the setting `auto-salvage`.
3. Create and attach a volume.
4. Disconnect the node that volume attaches to for 100 seconds.
5. Wait for the node back.
6. All replicas become failed and the volume will become `Faulted`.
This is might be expected? What caused the issue?
Steps for Scenario 1: 1. Launch a Longhorn system. 2. Disable the setting `auto-salvage`. 3. Create and attach a volume. 4. Disconnect the node that volume attaches to for 100 seconds. 5. Wait for the node back. 6. All replicas become failed and the volume will become `Faulted`.This is might be expected? What caused the issue?
This is the observation. The expectation for the result of step 6 is that the replicas on other nodes should still work fine without the auto salvage if there is no data writing during the disconnection.
Let's consider a corner case to see why the salvage is unreasonable:
ERR and the volume will be reattached only after the salvage. If the auto salvage is disabled by users, the volume will get stuck in Faulted state.ERR during the reattachment.[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at
https://github.com/longhorn/longhorn-manager/pull/617
https://github.com/longhorn/longhorn-manager/pull/619
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area: node down cases
Issues: attachment/detachment stuck
[x] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at
https://github.com/longhorn/longhorn-tests/wiki/Manual-Test-Plan#node-disconnection-test
Verified on Master - 07-10-2020
Validation - Passed
Scenarios:
Most helpful comment
Scenario 2 is caused by the unexpected deadlock in the engine controller after the node is back:
https://github.com/longhorn/longhorn-manager/blob/3879d122665f08841979516bfeb5ada6bcdaec14/controller/engine_controller.go#L313
Here is how the deadlock gets triggered:
vol1-e-89a70febbecomesUnknowndue to the network disconnection.[Updated v3]The solution for the deadlock:
wait.Until, we don't need to use the stop channel inside the engine monitor goroutine. Instead, the sync loop can return a boolean to indicate if the engine monitor needs to be stopped. (Using channel between callee (engine monitor sync loop fund) and the caller (wait.Until) is weird.)