There's quite a few issues where users have ruined their clusters following a master-upgrade. Many of these simply wipe the cluster and start from scratch. This is not always possible in a prod scenario.
The current backup/restore documentation for etcd-manager is very light, and fails to provide enough info on how to perform a restore when the etcd cluster has "changed" (since a node was likely removed and another added).
I've tried to understand how this all fits together for hours, but i'm simply stuck. We are extremely worried now that we realize we're using kops without being able to perform a restore if something goes wrong with a master - in practice it means we simply can't touch the masters at all.
I would argue that with "native" etcd there's at least some documentation on how to add/remove nodes from a cluster, so that basic disaster recovery is fairly straight-forward. Etcd-manager abstracts away all this, and I would therefore argue that it's extremely important to document the options one has to manage the etcd cluster "thru" etcd-manager.
I would be happy to submit a PR with enhanced documentation, but I'm simply not able to figure this out.
Thanks @trondhindenes for raising this issue. We are facing the same situation. Our clusters are stilling running Kops/kubernetes 1.10 version because we don't have a proper rollback strategy for upgrade of masters. Our Etcd databases are also running as 2.x in the production. Even we tested to upgrade them to 3.x successfully in other environments multiple times. We are still afraid to do that in production without a detailed document of how to restore them properly.
Thanks in advance for anyone who can share their knowledge or submit a PR with enhanced documentation.
@nickychow I ended up writing this, but I don't know if it works for you. https://t.co/HE0SnoGfZH
We're in almost the same situation as you @trondhindenes, trying to upgrade from 1.10.13 with kops 1.12.2 we have the cluster with three master nodes, but one of them is out of the etcd "ring" and we are not able to recover from this bad scenario
Thanks for the documentation @trondhindenes you made my day with this small guide, my master are back thanks to this blog post
@trondhindenes Thanks heaps for your great work. That's definitely going to help a lot.
@jjuarez I've made a small addition to the blog post detailing how to remove some stale etcd records which will corrupt the kubernetes service if left unchecked. I'm pretty sure that was added after you saw the guide, so feel free to revisit. It's at the very end of the blog post.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
for those want the updated url: https://www.hindenes.com/#/post/49956e54-04a7-45b7-84ea-fd6c2cbb4e15
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
correct url for the post is: https://hindenes.com/2019-08-09-kops-restore
For those who are using nginx-ingress controller and getting a 404 error for sites exposed through ingress after restoring etcd backup, follow below additional steps after etcd restore.
Most helpful comment
@nickychow I ended up writing this, but I don't know if it works for you. https://t.co/HE0SnoGfZH