Longhorn: [BUG]Sometimes unexpected behaviors after node is recovered from the disconnection

Created on 2 Jul 2020 · 7Comments · Source: longhorn/longhorn

Describe the bug
When the node is back from the network disconnections within a short period(5~6mins), all pods on the node won't be evicted. Then some unexpected issues will be triggered.
Scenario 1 (reproduce/always):
The attached volume will become faulted, then get salvaged and reattached if the setting auto-salvage is enabled.
Scenario 2 (reproduce/often):
The volume engines on the recovered node won't be updated. This means the volumes on the node cannot be attached/detached/deleted successfully and get stuck there.

To Reproduce
Steps for Scenario 1:

Launch a Longhorn system.
Disable the setting auto-salvage.
Create and attach a volume.
Disconnect the node that volume attaches to for 100 seconds.
Wait for the node back.
All replicas become failed and the volume will become Faulted.

Steps for Scenario 2:

Launch a Longhorn system.
Create a volume.
Attach the volume then disconnect the node that volume attaches to for 100 seconds.
Wait for the node back and the volume update.
Repeat step 2~4 until the volume gets stuck in Attached or Attaching state rather than becoming Healthy.
Create and attach more volumes to the recovered node. All volumes get stuck in Attaching state.
(If the result in step 5 is not shown after multiple retries (3 times for example), maybe you can remove the current Longhorn and redo the reproducing steps from the very beginning. I haven't found a way to always reproduce it)

Expected behavior
For Scenario 1:
The volume should be detached and reattached automatically and the replicas on the running nodes should be fine.
For Scenario 2:
The volumes on the recovered node should work fine.

Environment:

Longhorn version: v1.0.0
Kubernetes version: v1.17
Node OS type and version: Ubuntu 18.04

Additional context

The script used to disconnect a node:
network_down.sh.zip

When the node is disconnected, the related engine manager will wrongly mark all replicas for all running engine processes as ERR due to the Ping-timeout error. After the engine manager pod is back, longhorn managers will get those wrong engine processes status then lead to the volumes becoming Faulted.
Scenario 2 may be the cause of the unexpected validation failures mentioned in #1327, #1536, #1451

aremanager bug priorit1 reproducrare requirmanual-test-plan

Source

shuo-wu

Most helpful comment

Scenario 2 is caused by the unexpected deadlock in the engine controller after the node is back:
https://github.com/longhorn/longhorn-manager/blob/3879d122665f08841979516bfeb5ada6bcdaec14/controller/engine_controller.go#L313

Here is how the deadlock gets triggered:

The engine object vol1-e-89a70feb becomes Unknown due to the network disconnection.

time="2020-07-07T08:03:50Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"shuo-longhorn-cluster-1-worker-1\", UID:\"74ffe5f5-bdbe-43cf-9ee1-042d11c9253f\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"408815\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node shuo-longhorn-cluster-1-worker-1 not ready: NodeStatusUnknown"

time="2020-07-07T08:03:50Z" level=debug msg="Instance handler updated instance vol1-e-89a70feb state, old state running, new state unknown"



md5-d2c4f1e2baaca332e7e8aa4234a80f92



time="2020-07-07T08:03:50Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Succeed to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Complete stopMonitoring for engine vol1-e-89a70feb"



md5-010eaf310489763d368b77019708fd56



time="2020-07-07T08:03:55Z" level=debug msg="Engine monitor vol1-e-89a70feb starts a loop"



md5-066db99ed5fd9b3e65d20ad8811cca2c



time="2020-07-07T08:03:55Z" level=info msg="stop engine vol1-e-89a70feb monitoring because the engine is no longer running on node shuo-longhorn-cluster-1-worker-1"
......
time="2020-07-07T08:03:56Z" level=debug msg="Prepare to send signal to the channel in stop for engine vol1-e-89a70feb"



md5-f8d0306ba5cb53ca6b3a3ae1102fd690



E0707 08:04:49.921708       1 engine_controller.go:614] failed to update engine vol1-e-89a70feb to stop monitoring in stopMonitoring func: failed to reset engine status for vol1-e-89a70feb: Operation cannot be fulfilled on engines.longhorn.io "vol1-e-89a70feb": the object has been modified; please apply your changes to the latest version and try again
......
time="2020-07-07T08:04:52Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:04:52Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"

Other engine controller worker goroutines will be blocked by the unreleased lock when handling other engine objects. As a result, all engines belong to this node cannot be updated.

[Updated v3]The solution for the deadlock:

By removing the function wait.Until, we don't need to use the stop channel inside the engine monitor goroutine. Instead, the sync loop can return a boolean to indicate if the engine monitor needs to be stopped. (Using channel between callee (engine monitor sync loop fund) and the caller (wait.Until) is weird.)
Since there is only one goroutine using the channel, it's fine to directly close the channel after the signaling.

shuo-wu on 6 Jul 2020

👍2

All 7 comments

For Scenario 1, we can directly delete the related pod if the instance manager object becomes Unknown. Then longhorn manager pods won't accept wrong engine/replica info from the instance manager pod on the back node.

shuo-wu on 6 Jul 2020

Here is how the deadlock gets triggered:

The engine object vol1-e-89a70feb becomes Unknown due to the network disconnection.

time="2020-07-07T08:03:50Z" level=info msg="Event(v1.ObjectReference{Kind:\"Node\", Namespace:\"longhorn-system\", Name:\"shuo-longhorn-cluster-1-worker-1\", UID:\"74ffe5f5-bdbe-43cf-9ee1-042d11c9253f\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"408815\", FieldPath:\"\"}): type: 'Warning' reason: 'Ready' Kubernetes node shuo-longhorn-cluster-1-worker-1 not ready: NodeStatusUnknown"

time="2020-07-07T08:03:50Z" level=debug msg="Instance handler updated instance vol1-e-89a70feb state, old state running, new state unknown"



md5-d2c4f1e2baaca332e7e8aa4234a80f92



time="2020-07-07T08:03:50Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Succeed to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:03:50Z" level=debug msg="Complete stopMonitoring for engine vol1-e-89a70feb"



md5-010eaf310489763d368b77019708fd56



time="2020-07-07T08:03:55Z" level=debug msg="Engine monitor vol1-e-89a70feb starts a loop"



md5-066db99ed5fd9b3e65d20ad8811cca2c



time="2020-07-07T08:03:55Z" level=info msg="stop engine vol1-e-89a70feb monitoring because the engine is no longer running on node shuo-longhorn-cluster-1-worker-1"
......
time="2020-07-07T08:03:56Z" level=debug msg="Prepare to send signal to the channel in stop for engine vol1-e-89a70feb"



md5-f8d0306ba5cb53ca6b3a3ae1102fd690



E0707 08:04:49.921708       1 engine_controller.go:614] failed to update engine vol1-e-89a70feb to stop monitoring in stopMonitoring func: failed to reset engine status for vol1-e-89a70feb: Operation cannot be fulfilled on engines.longhorn.io "vol1-e-89a70feb": the object has been modified; please apply your changes to the latest version and try again
......
time="2020-07-07T08:04:52Z" level=debug msg="Call stopMonitoring for engine vol1-e-89a70feb"
time="2020-07-07T08:04:52Z" level=debug msg="Prepare to send signal to the channel in stopMonitoring for engine vol1-e-89a70feb"

Other engine controller worker goroutines will be blocked by the unreleased lock when handling other engine objects. As a result, all engines belong to this node cannot be updated.

[Updated v3]The solution for the deadlock:

By removing the function wait.Until, we don't need to use the stop channel inside the engine monitor goroutine. Instead, the sync loop can return a boolean to indicate if the engine monitor needs to be stopped. (Using channel between callee (engine monitor sync loop fund) and the caller (wait.Until) is weird.)
Since there is only one goroutine using the channel, it's fine to directly close the channel after the signaling.

shuo-wu on 6 Jul 2020

👍2

@shuo-wu great analysis :)

joshimoo on 8 Jul 2020

Steps for Scenario 1:

1. Launch a Longhorn system.
2. Disable the setting `auto-salvage`.
3. Create and attach a volume.
4. Disconnect the node that volume attaches to for 100 seconds.
5. Wait for the node back.
6. All replicas become failed and the volume will become `Faulted`.

This is might be expected? What caused the issue?

yasker on 8 Jul 2020

Steps for Scenario 1:

1. Launch a Longhorn system.
2. Disable the setting `auto-salvage`.
3. Create and attach a volume.
4. Disconnect the node that volume attaches to for 100 seconds.
5. Wait for the node back.
6. All replicas become failed and the volume will become `Faulted`.

This is might be expected? What caused the issue?

This is the observation. The expectation for the result of step 6 is that the replicas on other nodes should still work fine without the auto salvage if there is no data writing during the disconnection.
Let's consider a corner case to see why the salvage is unreasonable:

When data is being written to a volume, the related node is disconnected.
Since there is tiny inconsistency during the data write, some replicas complete the current data write and others are not. The replicas succeed to write data should be still the available replicas after the node reconnection.
But the current result is that all replicas will be blindly marked as ERR and the volume will be reattached only after the salvage. If the auto salvage is disabled by users, the volume will get stuck in Faulted state.
I think here is the correct behavior: the engine will be reattached automatically without salvage, and only the replicas not containing the lastest data will be marked as ERR during the reattachment.

shuo-wu on 9 Jul 2020

👍1

Pre-merged Checklist

[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at
https://github.com/longhorn/longhorn-manager/pull/617
https://github.com/longhorn/longhorn-manager/pull/619
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area: node down cases
Issues: attachment/detachment stuck
[x] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at
https://github.com/longhorn/longhorn-tests/wiki/Manual-Test-Plan#node-disconnection-test

shuo-wu on 9 Jul 2020

Verified on Master - 07-10-2020

Validation - Passed

Scenarios:

Disconnect the node for 100 secs while restore is in progress. - Restoring volume got connected to other healthy node and completed the restore.
Disconnect the node for 5 secs while restore is in progress. - Restoring volume got connected to other healthy node and completed the restore.
Reboot the node while restore is in progress. - Restoring volume got connected to other healthy node and completed the restore.
Poweroff the node while restore is in progress. - Restoring volume got connected to other healthy node and completed the restore.
Scenario 2 from https://github.com/longhorn/longhorn/issues/1545#issue-649882585 - Volume didn't stuck in attaching state.