Longhorn: how to recover from a node crash

Created on 6 Dec 2018  路  4Comments  路  Source: longhorn/longhorn

I have the following setup:

  • node1: all roles
  • node2: all roles
  • node3: etcd+control

I installed rancher and tested it with a statefulset and a deploymet + volume on node2. Rancher created volumes that where mounted on node2 and had replicas on node1.
After the the shutdown of node2, the deployments tried to recreate themself, but without success. Whenever I try to start a deployment with the volumes its stuck in containerCreating (without the volumes it works fine).

On the Nodes tab it shows node2 down:

grafik

on the Volume tab it shows the volumes as "Healthy":
grafik

In the volume view it shows all replicas running
grafik

My question is: what is the expected behaviour and can i fix the volumes manually ?

Most helpful comment

@XzenTorXz It's a bug. We aware of this as in https://github.com/rancher/longhorn/issues/199 and Kubernetes SIG Storage also aware of some unreliability of the workload migration with storage. We're working on fixing it, which will also involve some upstream Kubernetes work as well.

In the meantime, if you didn't hit the Kubernetes bug of https://github.com/rancher/longhorn/issues/199 , you can stop the workload and manually detach the volume. Restart the workload again should allow it to continue. If you hit the bug mentioned above, you would need to delete the csi-attacher-0 manually to allow statefulset to recreate, thus restore Kubernete's ability to attach the volume.

The error code you see is a symptom of what Kubernetes got wrong. The Kubernetes LB suppose to skip the node that's down, but it's still routing the request to the down node sometimes, thus you will see status code 502. We got some ideas on how to mitigate it, and it will be in the part of the fix for the issue.

All 4 comments

also on the volume view it shows the error

Request failed with status code 502

from time to time, not sure if this is related.

@XzenTorXz It's a bug. We aware of this as in https://github.com/rancher/longhorn/issues/199 and Kubernetes SIG Storage also aware of some unreliability of the workload migration with storage. We're working on fixing it, which will also involve some upstream Kubernetes work as well.

In the meantime, if you didn't hit the Kubernetes bug of https://github.com/rancher/longhorn/issues/199 , you can stop the workload and manually detach the volume. Restart the workload again should allow it to continue. If you hit the bug mentioned above, you would need to delete the csi-attacher-0 manually to allow statefulset to recreate, thus restore Kubernete's ability to attach the volume.

The error code you see is a symptom of what Kubernetes got wrong. The Kubernetes LB suppose to skip the node that's down, but it's still routing the request to the down node sometimes, thus you will see status code 502. We got some ideas on how to mitigate it, and it will be in the part of the fix for the issue.

I had to delete node2 in kubernetes and all replicas of the volumes hosted on node2, then I could delete node2 in longhorn. And It reattached everything just fine.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

saidghamra picture saidghamra  路  3Comments

lucernae picture lucernae  路  3Comments

yasker picture yasker  路  7Comments

anouarchattouna picture anouarchattouna  路  4Comments

excieve picture excieve  路  4Comments