Microk8s: Gracefully recover from out of date or corrupted dqlite member

Created on 17 Sep 2020  路  11Comments  路  Source: ubuntu/microk8s

Hello,

As always, thanks for your hard work on microk8s!

I have been running microk8s in a 3 node HA test-bed for a few days on the latest snap (from latest/edge) and eventually, after a few days of operation, one of the dqlite members has fallen out of sync. We have dbctl for taking backups and restoring DB state, which is great, however I haven't been able to pinpoint exactly why or when a node is going to fall out of sync, so I can't be sure exactly under which circumstances this happens. I have set up various telemetry tools, and I can tell that there is no significant IO, memory or CPU pressure on the nodes, it looks a though the cluster just collapses. I have attached a microk8s inspect from an impacted node. The issue appears to point solely back to dqlite, as the apiserver and kubelet no longer seem to be able to communicate with dqlite, causing all manner of failures. I'm happy to dig into this issue separately if it occurs again, and will raise a separate issue, hopefully with more data about what caused the initial failure.

However, the more interesting thing here, which I would like to focus on, is that after a rolling restart of the cluster, to try and get the dqlite cluster healthy, failed. I've done this in the past when this issue has cropped up (the failure of a single node's dqlite killing the control plane) however this time, it wouldn't work. Each node would time out trying to connect to port 19001 and eventually the apiserver would fail to start with a context cancelled error - i.e. a timeout. strace of a process shows an attempt to connect over the network to the cluster port of another node, (let's call this node, node1). Inspecting node1 shows it is crashing with SIGSEGV. I dug through the DB directory, and the modification dates on all files show as today-2days, whilst the files under /var/snap/microk8s/current/var/kubernetes/backend on the other two nodes (let's call them, node2 and node3) both show newer data with similar dqlite files in place. All nodes have the correct cluster keys and configuration per the info.yaml and cluster.yaml, just the data differs. Clearing the data on node1 (the crashing node) and node3 and restoring it from the contents of the /var/snap/microk8s/current/var/kubernetes/backend directory on node2, making sure to preserve the info.yaml, allowed me to "repair" the cluster and start it back up, after which all services were available again.

I can make a copy of the DB available for reference, it was too large to upload to this issue. However, I think the feature which is missing here, is some level of consistency checking on the database files. Crashing with SIGSEGV suggests to me that any form of corruption here will prevent the cluster from starting up after a cluster-wide failure, which I think is unexpected behaviour for a HA configuration. Additionally, some documentation for recovery would be really useful for operators of microk8s clusters, as restoring dqlite clusters is something that is not documented anywhere I could find. I'd be happy to contribute that documentation if you point me in the right direction, as I essentially had to reverse-engineer the way dqlite works in order to repair the cluster, it's fairly fresh in my mind.

Inspect from one of the nodes failing to start -
sig-segv-inspection.tar.gz

Hope this is all useful, happy to provide any further information, or access to the environment.

All 11 comments

Thank you @devec0 for the detailed description. Could you add a topic under https://discuss.kubernetes.io/ and tag it with "documentation" and "microk8s"? This is where our docs are served from. Looking forward to reading more on this.

cc @freeekanayaka

No problem, I'll likely get to writing some docs up over the weekend.
For now, I've made sure I am on snap revision 1711 for all three nodes, and have telemetry deployed so I can get further metrics on system resource usage etc. if it ends up being a factor if/when the cluster falls apart again.

Thanks!

OK, great news!

Well, not for my cluster, or the show I was watching on Netflix last night... I had a power outage which suddenly cut power so my cluster was taken down very abruptly.

When I powered my microk8s cluster back on, I managed to perfectly reproduce this issue. One node lagging from a replication perspective, the datastore potentially corrupted on that node, and none of the nodes can recover from this situation.

The result is that the cluster is not able to start back up. I've attached tarballs of the datastore from each node, and a full microk8s inspect from all three hosts. I've linked to the datastore files, because they're too big.

I hope this helps in reproducing the issue. I'll try a little later to see if my process for fixing it works again, and as I step through it, I'll document on the https://discuss.kubernetes.io/ discourse and tag with microk8s and documentation so anyone who needs to rebuild a dqlite-backed microk8s has a place to start from.
melchior-power-outage-inspection-20200922_212930.tar.gz
balthasar-power-outage-inspection-report-20200922_213013.tar.gz
casper-power-outage-inspection-report-20200922_213008.tar.gz

Datastore captures:
https://ec0.io/microk8s/balthasar-datastore-power-outage.tar.bz2
https://ec0.io/microk8s/casper-datastore-power-outage.tar.bz2
https://ec0.io/microk8s/melchior-datastore-power-outage.tar.bz2

I have written up this post on the kubernetes discourse, however was not able to add the correct tags ("documentation" and "microk8s") - are those tags potentially restricted to certain users?

@devec0 thanks for the detailed report. I'll have a look to the tarballs you linked.

Unfortunately the tarballs don't contain the dqlite database directory. Is the cluster still in the same broken state? @ktsakalozos what directory is microk8s using to store dqlite data? We'd need to get a tarball of that.

I guess it's at /var/snap/microk8s/current/var/kubernetes/backend.

Sorry, my bad I just noticed that there are 2 sets of tarballs that @devec0 provided, and the backend directory is in the second set. Should have looked more carefully at the beginning :) Looking now.

It seems that the node with address 172.16.10.21 has a corrupted snapshot. Removing the snapshot-2043-2843325-354658372 and snapshot-2043-2843325-354658372.metadata files in the backend directory seems to at least get rid of that problem.

Similarly, the node with address 172.16.10.23 has some corrupted state, running:

rm 0000000002720400-0000000002721086 0000000002721087-0000000002721087 0000000002721088-0000000002721696 0000000002721697-0000000002722322
rm snapshot-1930-2722419-296124261 snapshot-1930-2722419-296124261.meta

seems to do the trick.

The second corruption, on node 172.16.10.23, is something I've seen before, we don't have a reproducer yet, but it's something I'll be working on and at least I have an idea of what the problem could be.

The first corruption, on node 172.16.10.21 is more puzzling, it seems really bad snapshot data. I'll have to investigate further since it's the first time we see it.

Can we close this issue? Thanks to @devec0 for detailing the steps to recover a failing node.

My understanding was that there were some additional corruptions that @freeekanayaka noticed in the dumps I provided, and the hope was that identifying those and potentially being able to recover from them would see this being able to be closed out. I have still been seeing periodic corruption similar to that initially reported and have been following the recovery steps I posted on the discourse to correct, but I think being able to have microk8s/dqlite detect those corruptions and roll back problematic snapshots or checkpoints would be the ideal outcome here.

Was this page helpful?
0 / 5 - 0 ratings