Current status: https://github.com/longhorn/longhorn/blob/master/docs/node-failure.md~~ moved to https://longhorn.io/docs/0.8.0/users-guide/node-failure/
We require users to manually remove the workload in order get volume reattached. It's not ideal even given the limitation of Kubernetes.
We need to figure out a better way to make the workload recovery automatic.
Just for information: I read this problem description and tried it out.
Deleting the volumeattachment
$ kubectl delete volumeattachment csi-08d9842e.......
This leads into a situation where the volume can be attached to the new pod and the pod starts. Ok still the old pod is listed in 'terminating' status.
Maybe there is a way to delete the volumeattachment in such a scenario after a while automatically?
Thanks @rsoika . If deleting the volumeattachment works in this scenario, we can do that. It's better than deleting a pod from Longhorn. We will investigate.
I now reproduced the same behavior with a ceph cluster (octopus release). I think this is a general kubernetes problem and not a problem from longhorn.
I try to describe the problem now once again just for documentation (this link is no longer working).
Environment:
How to reproduce:
Solution: manually delete the volumeattachment or kill the terminating pod.
Thanks for the update @rsoika . I've updated the doc link (since we've moved our docs to https://longhorn.io/docs/).
Btw, we're totally aiming to solve this in the Longhorn's framework, if Kubernetes cannot solve it by itself.
It looks like there are discussions since nearly 2 years about this issue:
https://github.com/kubernetes/kubernetes/issues/65392
https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/issues/181
I am not a kubernets expert - just a 'normal' user not understanding enough to give further help. But there still seems to be no solution.
as you may have seen, I have had continued the discussions about the 'node-fancing' topic in the kubernetes-sigs project. It seems that there will be no solution in kubernetes in the near future. If you find a way to address this in Longhorn, I think this would be a unique feature compared to other solutions.
@rsoika We've decided to add support to auto-detach volume from failed deployment pod in the next release, which will be tracked at https://github.com/longhorn/longhorn/issues/820 , since this scenario is relatively straightforward and doesn't require extra permissions from Kubernetes.
We will start investigating other options after the GA.
We also want to implement the automatically recovery for the stateful set.
[x] Does the PR include the explanation for the fix or the feature?
[x] Is the backend code merged (Manager, Engine, Instance Manager, BackupStore etc)?
The PR is at https://github.com/longhorn/longhorn-manager/pull/640
[x] Is the reproduce steps/test steps documented?
[x] Which areas/issues this PR might have potential impacts on?
Area restore/recovery. Might have conflicts in a large cluster (> 30 nodes)
[x] If the fix introduces the code for backward compatibility Has an separate issue been filed with the label release/obsolete-compatibility?
The compatibility issue is filed at
[x] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at
[x] if labeled: require/doc Has the necessary document PR submitted or merged?
The Doc issue/PR is at https://github.com/longhorn/website/pull/183
[x] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
The automation skeleton PR is at
The automation test case PR is at
[x] if labeled: require/automation-engine Has the engine integration test been merged?
The engine automation PR is at
[x] if labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan PR is at https://github.com/longhorn/longhorn-tests/pull/395
The fix for StatefulSet
To reproduce
Default Replica Count = 2kubectl create -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml
kubectl shows that the pod stuck in terminating state forever. What are the validation steps?
It is in the manual test plan here https://github.com/longhorn/longhorn-tests/wiki/Manual-Test-Plan#improve-node-failure-handling
I will modify it and create a PR to move it to longhorn-test repo after we agree on the discussion https://github.com/longhorn/longhorn-manager/pull/640#issuecomment-670236555
I updated manual test plan at the PR https://github.com/longhorn/longhorn-tests/pull/395
This issue has been in Review for 21 days. What's blocking this issue moving to Merged?
I think this one should be good to merge. I just re-requested reviews from Joshua and Shuo.
Verified on Longhorn master - 08-27-2020
Validation - Pass
Scenarios Tested:
Pod Deletion Policy When Node is Down is do-nothingdo-nothing, delete-statefulset-pod, delete-deployment-pod and delete-both-statefulset-and-deployment-pod, anything other than these inputs throws an error.level=error msg="Error in request: fail to set settings with invalid node-down-pod-deletion-policy: value dummy of settings node-down-pod-deletion-policy is invalid: value dummy is not a valid choice, available choices [do-nothing delete-statefulset-pod delete-deployment-pod delete-both-statefulset-and-deployment-pod]"
scaling/upgrade policy as Start upgraded pods only when old ones are manually deleted, Rolling: stop old pods, then start new and Kill ALL pods, then start new. With all these policies, pods behavior is intact as mentioned https://longhorn.github.io/longhorn-tests/manual/pre-release/node/improve-node-failure-handling/
Most helpful comment
Btw, we're totally aiming to solve this in the Longhorn's framework, if Kubernetes cannot solve it by itself.