Longhorn: [BUG]when volume degraded and rebuild锛宼he filesystem in every mouted pod were read-only

Created on 17 Jun 2020  路  16Comments  路  Source: longhorn/longhorn

Describe the bug
When I did a performance test for Minio锛宻ome longhorn volumes are degraded and rebuilded锛宻top the test锛宎fter all volumes rebuild finished锛宎ll minio pod log error

Expected behavior
Expected all minio work normal

Log
image
image

Environment:

Most helpful comment

@sergiocharpineljr Yes, there was a disconnection happened. Since you said seems there is no resource pressure, so it might be caused by the temporary network outage.

The reason you saw two mounts is Longhorn tried to recover the volume so when the volume comeback, it remounted the volume back to the same location, in the hope that the workload would restart and pick up the correct mountpoint again.

Now thinking about it, maybe kill a stateful set or deployment pod would be the easier path to the recovery of the workload, compare to require the user to set the liveness probe and piling up mounts on the disk. cc @shuo-wu

All 16 comments

There is one (relative) common case that leads to the filesystem becoming ReadOnly:

  1. The volume (engine) gets crashed unexpectedly.
  2. The volume cannot be recovered automatically. Or the volume is recovered automatically(Reattach and remount) but the related workload is not restarted:

Here is the doc. I am not sure if the error you report is related to it. Since you launch a performance test for Minio, which may lead to node pressure and unexpected Longhorn workloads eviction.

I have also found out that from time to time filesytems in longhorn gets mounted ReadOnly due to node reboots or some other longhorn problems... it would be cool if longhorn would detect such problem and fix the problem by killing the corresponding pod.

Killing the pod is too aggressive. Maybe Longhorn can record an event as a warning if the case is detected.

A corresponding Prometheus metric would also help a lot

@runningman84 Prometheus metrics are tracking at #1180 (just realized you raised that issue as well 馃槃 )

As @shuo-wu said, we do try to cover from the read-only situation. But there are limitations so far as you can see in the doc

  1. You need to use ext4 as the filesystem inside the pod
  2. You need to setup liveness check for the pod in order to get it restarted automatically.

IMO, killing the pod is a bit too aggressive. If you've configured the liveness check, Kubernetes should restart the pod automatically if the read-only mount happened.

Btw, normally node reboot shouldn't cause the read-only mount to happen since the engine is running on the same node as the workload. Both of them will be recreated and it should be fine after the node is back up.

Read-only mount more likely happened when the CPU or network cannot satisfy the speed required by the storage. We've added GuaranteedEngineCPU option in v1.0.0 to help with the CPU starving issue. See here for details.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I'm also experiencing this issue with v 1.0.1. My cluster doesn't seem to have any resource pressure.

After the problem, longhorns devices seems to be mounted twice, ro and rw mounts.

# mount | grep aadc81dc
/dev/longhorn/pvc-aadc81dc-38e6-403a-be00-1f9c1373c122 on /var/lib/kubelet/pods/54e931ec-f854-4b57-bcff-af725eafa932/volumes/kubernetes.io~csi/pvc-aadc81dc-38e6-403a-be00-1f9c1373c122/mount type ext4 (ro,relatime)
/dev/longhorn/pvc-aadc81dc-38e6-403a-be00-1f9c1373c122 on /var/lib/kubelet/pods/54e931ec-f854-4b57-bcff-af725eafa932/volumes/kubernetes.io~csi/pvc-aadc81dc-38e6-403a-be00-1f9c1373c122/mount type ext4 (rw,relatime)

I can see some timeouts and some errors:

2020-08-27T03:32:52.223419922Z [pvc-de5f204e-3f2a-41c5-b15b-588c42bbd654-e-4afeeac9] time="2020-08-27T03:32:52Z" level=error msg="Write timeout on replica 10.42.1.33:10001 seq= 29908 size= 4 (kB)"
2020-08-27T03:32:52.223462799Z time="2020-08-27T03:32:52Z" level=error msg="Retry  1 on replica 10.42.1.33:10001 seq= 29908 size= 4 (kB)"
...
2020-08-28T07:02:37.290019768Z tgtd: bs_longhorn_request(91) fail to write at 13022289920 for 28672
2020-08-28T07:02:37.290191901Z tgtd: bs_longhorn_request(150) io error 0x560b35a664e0 2a -14 28672 13022289920, Success
2020-08-28T07:02:37.290199347Z response_process: Receive error for response 3 of seq 599217
2020-08-28T07:02:37.290205772Z response_process: Receive error for response 3 of seq 599209
2020-08-28T07:02:37.290211984Z response_process: Receive error for response 3 of seq 599212
2020-08-28T07:02:37.290218135Z tgtd: bs_longhorn_request(91) fail to write at 12884910080 for 4096
2020-08-28T07:02:37.290224225Z tgtd: bs_longhorn_request(150) io error 0x560b359ae400 2a -14 4096 12884910080, Success
2020-08-28T07:02:37.290230436Z response_process: Receive error for response 3 of seq 599205
2020-08-28T07:02:37.290245026Z tgtd: bs_longhorn_request(91) fail to write at 13019717632 for 4096
2020-08-28T07:02:37.290253088Z response_process: Receive error for response 3 of seq 599211
2020-08-28T07:02:37.290259400Z tgtd: bs_longhorn_request(150) io error 0x560b35a3e7a0 2a -14 4096 13019717632, Success
2020-08-28T07:02:37.290265778Z response_process: Receive error for response 3 of seq 599219
2020-08-28T07:02:37.290272196Z tgtd: bs_longhorn_request(91) fail to write at 12885131264 for 4096
2020-08-28T07:02:37.290278494Z tgtd: bs_longhorn_request(150) io error 0x560b359aec40 2a -14 4096 12885131264, Success
2020-08-28T07:02:37.290284801Z tgtd: bs_longhorn_request(91) fail to write at 12885098496 for 4096
2020-08-28T07:02:37.290290997Z tgtd: bs_longhorn_request(150) io error 0x560b359ae980 2a -14 4096 12885098496, Success
2020-08-28T07:02:37.290297229Z tgtd: bs_longhorn_request(91) fail to write at 12885274624 for 49152
2020-08-28T07:02:37.290303584Z tgtd: bs_longhorn_request(150) io error 0x560b359b7980 2a -14 49152 12885274624, Success
2020-08-28T07:02:37.290309832Z tgtd: bs_longhorn_request(91) fail to write at 13019234304 for 4096
2020-08-28T07:02:37.290316049Z tgtd: bs_longhorn_request(150) io error 0x560b359c26c0 2a -14 4096 13019234304, Success
..
2020-08-28T07:02:44.676124473Z time="2020-08-28T07:02:36Z" level=info msg="Ignore set replica tcp://10.42.1.33:10030 to mode ERR due to it's ERR"
2020-08-28T07:02:44.676136754Z time="2020-08-28T07:02:36Z" level=error msg="Setting replica tcp://10.42.0.167:10045 to ERR due to: r/w timeout"
2020-08-28T07:02:44.676148963Z time="2020-08-28T07:02:36Z" level=info msg="Ignore set replica tcp://10.42.0.167:10045 to mode ERR due to it's ERR"
2020-08-28T07:02:44.676161355Z time="2020-08-28T07:02:36Z" level=error msg="I/O error: tcp://10.42.1.33:10030: r/w timeout; tcp://10.42.0.167:10045: r/w timeout"
..
2020-08-28T07:02:44.677091317Z time="2020-08-28T07:02:36Z" level=error msg="I/O error: No backend available"
2020-08-28T07:02:44.677103559Z time="2020-08-28T07:02:36Z" level=error msg="I/O error: No backend available"
2020-08-28T07:02:44.677115727Z time="2020-08-28T07:02:36Z" level=error msg="I/O error: No backend available"

But in the end it gets synchronized:

2020-08-28T09:20:09.600157775Z [pvc-aadc81dc-38e6-403a-be00-1f9c1373c122-e-d7155fd2] time="2020-08-28T09:20:09Z" level=info msg="Get backend tcp://10.42.0.167:10045 revision counter 4870313"
2020-08-28T09:20:09.600229889Z time="2020-08-28T09:20:09Z" level=info msg="Set revision counter of 10.42.1.33:10030 to : 4870313"
2020-08-28T09:20:09.608348372Z [pvc-aadc81dc-38e6-403a-be00-1f9c1373c122-e-d7155fd2] time="2020-08-28T09:20:09Z" level=info msg="Set backend tcp://10.42.1.33:10030 revision counter to 4870313"
2020-08-28T09:20:09.608410556Z time="2020-08-28T09:20:09Z" level=info msg="WO replica tcp://10.42.1.33:10030's chain verified, update mode to RW"
2020-08-28T09:20:09.610017710Z [pvc-aadc81dc-38e6-403a-be00-1f9c1373c122-e-d7155fd2] time="2020-08-28T09:20:09Z" level=info msg="Set replica tcp://10.42.1.33:10030 to mode RW"

livenessProbe is not a good solution for me because I've already have one for monitoring HTTP. I could create a script to acomplish both things but it would complicate things. I think longhorn should be able to recover.

@sergiocharpineljr Yes, there was a disconnection happened. Since you said seems there is no resource pressure, so it might be caused by the temporary network outage.

The reason you saw two mounts is Longhorn tried to recover the volume so when the volume comeback, it remounted the volume back to the same location, in the hope that the workload would restart and pick up the correct mountpoint again.

Now thinking about it, maybe kill a stateful set or deployment pod would be the easier path to the recovery of the workload, compare to require the user to set the liveness probe and piling up mounts on the disk. cc @shuo-wu

I have the same issue with minio. In the read-only case minio keeps running and there are missing images ocurring on the software using the minio storage.
Mongodb and mysql crashes instantly.

Are there solutions or ideas already?

Having this issue as well. Always occurs when/shorty after longhorn components are automatically restarted. Our VM provider shifts VMs on a daily basis which causes the longhorn cluster components to lose quorum.
Thats why the read-only issue comes up once or twice a week.

We are already looking into other storage providers since this is a big issue for us.

@derzufall Hi, its a big issue for us as well. Which storage provider do you use? And which others do you check out?

We've switched to delete the pod (who will be recreated by Kubernetes automatically) in v1.1.0 release ( #1719). The current way of using the liveness probe indeed has quite a few shortcomings.

Btw, @derzufall there is no quorum in Longhorn. As long as there is one healthy replica, Longhorn would work. By shifting VM, you meant VMs are live migrated from one node to another? Yes, that would cause a network outage and Longhorn will likely be disconnected, but it's designed to recover automatically after. But to recover gracefully in v1.0.x, a liveness probe to restart the pod automatically is needed.

Btw, @derzufall there is no quorum in Longhorn. As long as there is one healthy replica, Longhorn would work. By shifting VM, you meant VMs are live migrated from one node to another? Yes, that would cause a network outage and Longhorn will likely be disconnected, but it's designed to recover automatically after. But to recover gracefully in v1.0.x, a liveness probe to restart the pod automatically is needed.

Hey @yasker, thank you for clearing that up! Yes I meant the VMs are moved between nodes. Then the CSI workloads of longhorn might lose their leader and then the CSI pod crashes. After that we occasionally see the read-only problem. It happens about once a week that a part of workloads see a read-only filesystem. However this is a operational nightmare...

Liveness probe doesnt seem to cover this problem tho. Our gitlab instance will die if the filesystem becomes read only. Kubernetes will instantly restart, but the read-only issue keeps persisting. We have to scale the workload down to 0 and then back up again to fix this.

Hoping 1.1.0 will make sure that the system recovers automatically.

PS: we use ext4 filesystems.

Lets say the new longhorn version recovers by scaling down and up automatically and the pod comes back after couple of seconds. Can we say its stable enough to deploy a production service? We would always have downtimes x times a day ( in my case nextcloud/gitlab/..etc crashes each day)?

@derzufall There are a few features in the upcoming v1.1.0 that should help in your live migration case:

  1. the data locality feature can keep a replica always local to the workload, so even after the live migration, the workload should still have a local replica to it, then it can continue to function even with a small network outage.
  2. existing replica rebuild feature should pick up the remaining replicas (which might be considered error out) and quickly rebuild them.
  3. finally, if the volume still failed due to some reason, Longhorn can delete the workload pod (as long as it's controlled by a deployment or statefulset) to allow Kubernetes to recreate it, which will redo the attach/mount operations to ensure the workload will be up and running correctly with the volume.

@DarianAnjuhal You're right. It's not the full answer since the expectation is zero downtime. In the future, we should expect the Longhorn system to tell us it's short on CPU, or memory, or network, or even disk IO. In that way, we should know the cause of any outage and find ways to mitigate it. I've filed https://github.com/longhorn/longhorn/issues/1930 to track that effort.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

saidghamra picture saidghamra  路  3Comments

hillbun picture hillbun  路  6Comments

ainiml picture ainiml  路  6Comments

lucevers picture lucevers  路  4Comments

Angelinsky7 picture Angelinsky7  路  8Comments