Longhorn: [FEATURE] Improve node failure handling

Created on 13 Mar 2020 · 17Comments · Source: longhorn/longhorn

Current status: ~~https://github.com/longhorn/longhorn/blob/master/docs/node-failure.md~~ moved to https://longhorn.io/docs/0.8.0/users-guide/node-failure/~~

We require users to manually remove the workload in order get volume reattached. It's not ideal even given the limitation of Kubernetes.

We need to figure out a better way to make the workload recovery automatic.

arekubernetes aremanager enhancement highlight priorit1 requirLEP requirdoc requirmanual-test-plan
Source

~~yasker~~

Most helpful comment

Btw, we're totally aiming to solve this in the Longhorn's framework, if Kubernetes cannot solve it by itself.

yasker on 17 Apr 2020

👍2

All 17 comments

Just for information: I read this problem description and tried it out.
Deleting the volumeattachment

$ kubectl delete volumeattachment csi-08d9842e.......

This leads into a situation where the volume can be attached to the new pod and the pod starts. Ok still the old pod is listed in 'terminating' status.

Maybe there is a way to delete the volumeattachment in such a scenario after a while automatically?

rsoika on 13 Apr 2020

Thanks @rsoika . If deleting the volumeattachment works in this scenario, we can do that. It's better than deleting a pod from Longhorn. We will investigate.

yasker on 13 Apr 2020

I now reproduced the same behavior with a ceph cluster (octopus release). I think this is a general kubernetes problem and not a problem from longhorn.

I try to describe the problem now once again just for documentation (this link is no longer working).

Environment:

Kubernetes Cluster 1.18.1.

3 worker nodes (node1,node2,node3)

How to reproduce:

POD (e.g. a posgress dabase) is deployed on node1. The POD defies a PersistentVolumeClaim for storage (hosted on longhorn)

Hard shutdown from node1

Kubernetes (after 5 minutes) recognize lost node and starts a rescheduling of the POD to node2

POD on node2 does not start because the volume is still attached to the terminating POD on node1 which is in a durable terminating status

Solution: manually delete the volumeattachment or kill the terminating pod.

rsoika on 17 Apr 2020

Thanks for the update @rsoika . I've updated the doc link (since we've moved our docs to https://longhorn.io/docs/).

yasker on 17 Apr 2020

Btw, we're totally aiming to solve this in the Longhorn's framework, if Kubernetes cannot solve it by itself.

yasker on 17 Apr 2020

👍2

It looks like there are discussions since nearly 2 years about this issue:

https://github.com/kubernetes/kubernetes/issues/65392
https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/issues/181

I am not a kubernets expert - just a 'normal' user not understanding enough to give further help. But there still seems to be no solution.

rsoika on 17 Apr 2020

as you may have seen, I have had continued the discussions about the 'node-fancing' topic in the kubernetes-sigs project. It seems that there will be no solution in kubernetes in the near future. If you find a way to address this in Longhorn, I think this would be a unique feature compared to other solutions.

rsoika on 28 Apr 2020

👍1

@rsoika We've decided to add support to auto-detach volume from failed deployment pod in the next release, which will be tracked at https://github.com/longhorn/longhorn/issues/820 , since this scenario is relatively straightforward and doesn't require extra permissions from Kubernetes.

We will start investigating other options after the GA.

yasker on 4 May 2020

👍1

We also want to implement the automatically recovery for the stateful set.

yasker on 25 Jul 2020

Pre-merged Checklist

[x] Does the PR include the explanation for the fix or the feature?

[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/640

[x] Is the reproduce steps/test steps documented?

[x] Which areas/issues this PR might have potential impacts on?
Area restore/recovery. Might have conflicts in a large cluster (> 30 nodes)

[x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at

[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at

[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at https://github.com/longhorn/website/pull/183

[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at

[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at

[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan PR is at https://github.com/longhorn/longhorn-tests/pull/395

longhorn-io-github-bot on 4 Aug 2020

The fix for StatefulSet
To reproduce

Set up a cluster of 3 nodes.

Install Longhorn and set Default Replica Count = 2

create a SetfullSet. Ex:

kubectl create -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml

Find the node which contains one pod of the StatefullSet. Power off the node

After about 5 minutes, checking kubectl shows that the pod stuck in terminating state forever.

PhanLe1010 on 5 Aug 2020

What are the validation steps?

yasker on 7 Aug 2020

It is in the manual test plan here https://github.com/longhorn/longhorn-tests/wiki/Manual-Test-Plan#improve-node-failure-handling

I will modify it and create a PR to move it to longhorn-test repo after we agree on the discussion https://github.com/longhorn/longhorn-manager/pull/640#issuecomment-670236555

PhanLe1010 on 7 Aug 2020

I updated manual test plan at the PR https://github.com/longhorn/longhorn-tests/pull/395

PhanLe1010 on 18 Aug 2020

This issue has been in Review for 21 days. What's blocking this issue moving to Merged?

yasker on 25 Aug 2020

I think this one should be good to merge. I just re-requested reviews from Joshua and Shuo.

PhanLe1010 on 25 Aug 2020

Verified on Longhorn master - 08-27-2020

Validation - Pass

Scenarios Tested:

The default value for setting Pod Deletion Policy When Node is Down is do-nothing

The setting doesn't take empty value.

The setting takes input do-nothing, delete-statefulset-pod, delete-deployment-pod and delete-both-statefulset-and-deployment-pod, anything other than these inputs throws an error.

level=error msg="Error in request: fail to set settings with invalid node-down-pod-deletion-policy: value dummy of settings node-down-pod-deletion-policy is invalid: value dummy is not a valid choice, available choices [do-nothing delete-statefulset-pod delete-deployment-pod delete-both-statefulset-and-deployment-pod]"

The test cases from https://longhorn.github.io/longhorn-tests/manual/pre-release/node/improve-node-failure-handling/

Pods with scaling/upgrade policy as Start upgraded pods only when old ones are manually deleted, Rolling: stop old pods, then start new and Kill ALL pods, then start new. With all these policies, pods behavior is intact as mentioned https://longhorn.github.io/longhorn-tests/manual/pre-release/node/improve-node-failure-handling/

No impact on pods don't have longhorn volume attached.

khushboo-rancher on 28 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Deleted server in cluster still listed in longhorn node list as marked "down"

shinebayar-g · 18Comments

Feature Request: Pause replica rebuild for server maintenance

shubb30 · 22Comments

k3s v1.19.2+k3s1 : longhorn-driver-deployer CrashLoopBackOff

clemenko · 18Comments

[BUG]when volume degraded and rebuild，the filesystem in every mouted pod were read-only

zshi456 · 16Comments

[FEATURE] Prometheus support

runningman84 · 16Comments