Longhorn: manager: Fail to set volume status after all replica failed may result in volume is not attachable.

Created on 27 May 2019 · 3Comments · Source: longhorn/longhorn

v0.5.0.

The error message for attaching:

logs/longhorn-manager-k4qqc/longhorn-manager.log:2019-05-27T09:23:06.347103712Z time="2019-05-27T09:23:06Z" level=warning msg="Error syncing Longhorn volume longhorn-system/pvc-499dbb42-78b8-11e9-9f61-000c299a5b45: fail to sync longhorn-system/pvc-499dbb42-78b8-11e9-9f61-000c299a5b45: fail to reconcile volume state for pvc-499dbb42-78b8-11e9-9f61-000c299a5b45: no healthy replica for starting"

And it caused by the follow issue in this case. But it can be due to other reasons.

logs/longhorn-manager-k4qqc/longhorn-manager.log:2019-05-27T09:23:06.346648747Z E0527 09:23:06.346487       1 volume_controller.go:200] fail to sync longhorn-system/pvc-499dbb42-78b8-11e9-9f61-000c299a5b45: Timeout: request did not complete within requested timeout 30s
logs/longhorn-manager-k4qqc/longhorn-manager.log:2019-05-27T09:23:06.346702385Z time="2019-05-27T09:23:06Z" level=warning msg="Dropping Longhorn volume longhorn-system/pvc-499dbb42-78b8-11e9-9f61-000c299a5b45 out of the queue: fail to sync longhorn-system/pvc-499dbb42-78b8-11e9-9f61-000c299a5b45: Timeout: request did not complete within requested timeout 30s"

This is caused by the volume faulted state can only be triggered once at the moment all replicas failed. We should reconcile to the faulted state once we detected all the replicas has failed.

aremanager bug

Source

yasker

👍2

Most helpful comment

You can use the following command to forcefully mark the volume as faulted then it can be salvaged:

kubectl -n longhorn-system patch lhv <volume_name> --type="merge" -p '{"status":{"robustness":"faulted"}}'

is the longhorn volume name, like pvc-499dbb42-78b8-11e9-9f61-000c299a5b45

yasker on 26 Jun 2019

👍3

All 3 comments

You can use the following command to forcefully mark the volume as faulted then it can be salvaged:

kubectl -n longhorn-system patch lhv <volume_name> --type="merge" -p '{"status":{"robustness":"faulted"}}'

is the longhorn volume name, like pvc-499dbb42-78b8-11e9-9f61-000c299a5b45

yasker on 26 Jun 2019

👍3

Steps to reproduce it:

Create a volume and attach it to a node. Then wait for healthy.
Delete all related replica pods quickly.
Check volume Health.
Use command
kubectl -n longhorn-system patch lhv <volume-name> --type="merge" -p '{"status":{"robustness":""}}'

Without fix: the volume is in unknown state
After fix: the volume is still in faulted state.