Kops: Document how to restore etcd after a failed master upgrade

Created on 8 Aug 2019 · 13Comments · Source: kubernetes/kops

There's quite a few issues where users have ruined their clusters following a master-upgrade. Many of these simply wipe the cluster and start from scratch. This is not always possible in a prod scenario.

The current backup/restore documentation for etcd-manager is very light, and fails to provide enough info on how to perform a restore when the etcd cluster has "changed" (since a node was likely removed and another added).

I've tried to understand how this all fits together for hours, but i'm simply stuck. We are extremely worried now that we realize we're using kops without being able to perform a restore if something goes wrong with a master - in practice it means we simply can't touch the masters at all.

I would argue that with "native" etcd there's at least some documentation on how to add/remove nodes from a cluster, so that basic disaster recovery is fairly straight-forward. Etcd-manager abstracts away all this, and I would therefore argue that it's extremely important to document the options one has to manage the etcd cluster "thru" etcd-manager.

I would be happy to submit a PR with enhanced documentation, but I'm simply not able to figure this out.

lifecyclrotten

Source

trondhindenes

👍1

Most helpful comment

@nickychow I ended up writing this, but I don't know if it works for you. https://t.co/HE0SnoGfZH

trondhindenes on 12 Aug 2019

🎉2 👍2

All 13 comments

Thanks @trondhindenes for raising this issue. We are facing the same situation. Our clusters are stilling running Kops/kubernetes 1.10 version because we don't have a proper rollback strategy for upgrade of masters. Our Etcd databases are also running as 2.x in the production. Even we tested to upgrade them to 3.x successfully in other environments multiple times. We are still afraid to do that in production without a detailed document of how to restore them properly.

Thanks in advance for anyone who can share their knowledge or submit a PR with enhanced documentation.

nickychow on 12 Aug 2019

@nickychow I ended up writing this, but I don't know if it works for you. https://t.co/HE0SnoGfZH

trondhindenes on 12 Aug 2019

🎉2 👍2

We're in almost the same situation as you @trondhindenes, trying to upgrade from 1.10.13 with kops 1.12.2 we have the cluster with three master nodes, but one of them is out of the etcd "ring" and we are not able to recover from this bad scenario

jjuarez on 22 Aug 2019

Thanks for the documentation @trondhindenes you made my day with this small guide, my master are back thanks to this blog post

jjuarez on 22 Aug 2019

@trondhindenes Thanks heaps for your great work. That's definitely going to help a lot.

nickychow on 23 Aug 2019

@jjuarez I've made a small addition to the blog post detailing how to remove some stale etcd records which will corrupt the kubernetes service if left unchecked. I'm pretty sure that was added after you saw the guide, so feel free to revisit. It's at the very end of the blog post.

trondhindenes on 3 Sep 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Dec 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 1 Jan 2020

for those want the updated url: https://www.hindenes.com/#/post/49956e54-04a7-45b7-84ea-fd6c2cbb4e15

guikcd on 22 Jan 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 22 Feb 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 22 Feb 2020

correct url for the post is: https://hindenes.com/2019-08-09-kops-restore

trondhindenes on 17 May 2020

For those who are using nginx-ingress controller and getting a 404 error for sites exposed through ingress after restoring etcd backup, follow below additional steps after etcd restore.