Kops: Improve backup/restore documentation w/ etcd-manager

Created on 23 May 2019 · 31Comments · Source: kubernetes/kops

The current backup/restore documentation is old and not useful anymore with etcd-manager.
As discussed yesterday at Kubecon, the restore documentation should be updated to provide better info on how to do a restore with etcd-manager. The etcd-manager repository does include some basic information on the restore procedure, but we should probably include more information, including troubleshooting steps.

As mentioned at Kubecon, I'd like to work on this documentation. @justinsb Do you think the docs itself will fit best in the etcd-manager repo (and then link to them from the kops docs), or put those docs in the kops repository? From my perspective, it makes more sense from a technical pov in the etcd-manager repository, though from a user perspective maybe more here in the kops repository.

Ideas for additional documentation:

Include regular steps for recovery (etcd-manager-ctl, listing backups, restoring both events and main clusters)
Fixing leadership token issues
Fixing issues with old masters in master leases in etcd (kube-apiserver not working from the cluster, flannel not working)

lifecyclrotten

Source

dzoeteman

Most helpful comment

@angeloskaltsikis that's the issue I mentioned regarding master leases. Basically you have to manually remove the old master IP from the master leases in Etcd. This requires a manual Etcd edit. I'll include that in the docs too later when I have some more time.

Edit: for more info: the kubernetes svc gets automatically filled with endpoints based on the master leases, so it doesn't help to remove them from the svc, as they'll just get added again.

dzoeteman on 3 Sep 2019

👍2

All 31 comments

Thanks so much for opening this! It was great meeting you this week!

I agree, I’m torn between the 2 locations, maybe most of it belongs in etcd-manager but a core subset belong here? Aka the basics, then link to the full docs for more details?

Or I guess, in your outage, you probably would look here then appreciate a link to more docs, so that’s how I would focus on building them. Does that make sense?

Looking forward to reviewing these! Remember, it doesn’t need to be perfect all at once, incremental changes can be helpful too, especially whatever was able to save you (others may run into the same thing)!

mikesplain on 24 May 2019

@mikesplain Yeah, that makes sense! So the basic backup/restore procedure here, and then link for more troubleshooting steps etc to the etcd-manager repo.

dzoeteman on 24 May 2019

👍1

marcindulak on 25 May 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 23 Aug 2019

kopeio/etcd-manager#201 is not clear, I am new to bazel and etcd, unable to build this.

@pracucci, Can you help us with, some guidance to do back up and restore using 'etcd-manager-ctl'

vijay-veeranki on 27 Aug 2019

@vijay-veeranki-moj The binaries are located here (in assets with the latest releases): https://github.com/kopeio/etcd-manager/releases

Docs for how to do backup/restore is here: https://github.com/kubernetes/kops/blob/master/docs/etcd/backup-restore.md

Edit: I see the documentation doesn't say yet that the binaries are now being shipped with releases. I'll try and get a PR through tonight to fix that.

dzoeteman on 27 Aug 2019

The following is from the docs. It would be nice if the "restart etcd" part was clarified a bit. Right now we just roll all masters and hope for the best.

Note that this does not start the restore immediately; you need to restart etcd on all masters (or roll your masters quickly). A new etcd cluster will be created and the backup will be restored onto this new cluster. Please note that this process might take a short while, depending on the size of your cluster.

olemarkus on 27 Aug 2019

https://github.com/kopeio/etcd-manager instead of "restart etcd on all masters" says "bounce the first etcd-manager process (control-C and relaunch)".

It would be nice to have a complete scenario on the backup/restore process on AWS, including the expected log entries at every step.

marcindulak on 27 Aug 2019

I am trying to restore the cluster from an s3 back up, using
etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/main restore-backup [main backup file],
but i am facing issues, and there is no guidance any where related to the issues.

My scenario to test is,
1) create a cluster using kops and do some deployments , s3 takes back up.
2) kops delete cluster, here kops also delete the s3 backup created??
3) Kops create the cluster using the same kops config file
4) Cluster is up and running
5) Try to restore from the s3 back up.

issue, when i initially deleted the cluster, using kops delete, it deleted the s3 bucket, but i have versioned it so i still have backups.
so using that backups when i do restore using etcd-manager-ctl , I am getting errors

a) unexpected error running etcd cluster reconciliation loop: 
b) (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
c) Connection to the datastore is unauthorized from calico
d) loading OpenAPI spec for "v1beta1.admission.certmanager.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable - from kube-apiserver
e) kube-dns error Pod sandbox changed, it will be killed and re-created.

If we have clear guidance regarding steps to do the restore, it will be very helpful, like running the etcd-manager-ctl from our workstation is ok? or de we need to run them inside the leader container.

vijay-veeranki on 28 Aug 2019

a) unexpected error running etcd cluster reconciliation loop:
b) (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
^^ These errors got fixed by it's own eventually.

c) Connection to the datastore is unauthorized from calico
d) loading OpenAPI spec for "v1beta1.admission.certmanager.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable - from kube-apiserver
e) kube-dns error Pod sandbox changed, it will be killed and re-created.

Error's above was caused because of the token, removed the old tokens from -n kube-system, new ones got created, and deleted the pods. They are back up and running now.

vijay-veeranki on 29 Aug 2019

/remove-lifecycle stale

dzoeteman on 2 Sep 2019

@vijay-veeranki-moj @olemarkus Added download info & clarified restarting etcd in #7506.

@marcindulak that part of the documentation refers to local development, not restoring a production cluster.

Vijay - I am not sure about the other errors you're getting. It might take a while for the cluster to fix itself after a restore, or might even require you rolling all your nodes to force it to change. Regarding your question like running the etcd-manager-ctl from our workstation is ok? or de we need to run them inside the leader container - yes, you can run it locally as long as you have access to the state store. All etcd-manager-ctl does is add a command to a file in S3, which will be picked up by the etcd-manager leader.

dzoeteman on 2 Sep 2019

Thanks @dzoeteman

Regarding fixing the below issues, can we get some guidance on them?
Fixing leadership token issues
Fixing issues with old masters in master leases in etcd (kube-apiserver not working from the cluster, flannel not working)

Also i am facing the issue on all the token's after the restore from back up, for example tokens like below.

NAME                                             TYPE                                  DATA   AGE
default-token-srztw                              kubernetes.io/service-account-token   3      20d
kube-proxy-token-x7p78                           kubernetes.io/service-account-token   3      20d
kube-dns-token-h6qfk                             kubernetes.io/service-account-token   3      20d
dns-controller-token-vtn85                       kubernetes.io/service-account-token   3      20d
kube-dns-autoscaler-token-dbm5d                  kubernetes.io/service-account-token   3      20d
calico-node-token-6w6p2                          kubernetes.io/service-account-token   3      20d
cronjobber-token-4n82j                           kubernetes.io/service-account-token   3      20d
tiller-token-xnmnx                               kubernetes.io/service-account-token   3      20d
metrics-server-token-t9pv9                       kubernetes.io/service-account-token   3      20d
external-dns-dsd-external-dns-token-w26ts        kubernetes.io/service-account-token   3      20d
external-dns-token-24kjn                         kubernetes.io/service-account-token   3      20d

All the pods related to this tokens are in (not ready, pending or crash looping) status.

To fix this i am deleting the tokens and restarting the pods to fix this, do you know why is this caused ??
Am i doing it right, or is there any other procedure to fix this issue.

vijay-veeranki on 3 Sep 2019

@vijay-veeranki-moj @olemarkus Added download info & clarified restarting etcd in #7506.

@marcindulak that part of the documentation refers to local development, not restoring a production cluster.

Vijay - I am not sure about the other errors you're getting. It might take a while for the cluster to fix itself after a restore, or might even require you rolling all your nodes to force it to change. Regarding your question like running the etcd-manager-ctl from our workstation is ok? or de we need to run them inside the leader container - yes, you can run it locally as long as you have access to the state store. All etcd-manager-ctl does is add a command to a file in S3, which will be picked up by the etcd-manager leader.

Thanks for the clarifications @dzoeteman . I can verify the additions & improvements in documentation you are suggesting in #7506 as these were the ones that i found out after researching the etcd restoration the last few days.

1) I started trying to trigger the restoration after i have submitted the command with etcd-manager-clt by using
kubectl get pods -n kube-system --no-headers=true | awk '/etcd-manager/{print $1}'| xargs kubectl delete -n kube-system pod
which did not work.

2) Also i have continued to try the restore process

kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --state s3://blah-blah --force --yes

which did work as it quickly rolled the master however after the restoration i have encountered several issues as many containers of the previous state that i have restored did not schedule and kept in Pending state. So even if i managed to trigger the restoration command i still haven't managed to make use of it.

3) I have also tried to

a) Create a cluster
b) Install Several Packages on it
c) Keep etcd-main & etcd-events backups from S3
d) Delete Cluster
e) Create new Cluster
f) Restore the old Backup

but still there were many issues during that phase and i can not mark that as succesful.

4) After that i have ssh'ed inside the K8s Nodes and i tried to issue the docker restart command on the leader of etcd-main & on the leader of etcd-events.
This have triggered the restoration process but still i have encountered some problems (Couldn't schedule the dns-controller pod, i deleted the pod & have scaled the deployment down & up but after that the pod didn't show up at all on kubectl get pods) with the cluster after the restoration finished.

I will still continue to test how we can reliably restore etcd via etcd-manager as we need a good DR plan on our clusters and in case i have any updates i will update you here.

angeloskaltsikis on 3 Sep 2019

Hi @angeloskaltsikis

Thanks for sharing your experience.

I am performing the similar task, to restore the cluster.

I am facing similar issues, One thing i tried is deleting the old tokens, which is getting created back automatically and later restarting the pods, which fixed some of the issues.(Not sure why is should do it though)

Some guidance here https://hindenes.com/2019-08-09-Kops-Restore/

vijay-veeranki on 3 Sep 2019

🚀1

@angeloskaltsikis Thanks for the review, I will take a look once I get home.
@vijay-veeranki-moj Regarding leadership token issue and master leases; I will see what I can add in the coming days. I'm unsure if I should put those docs here in the kops repo or in the etcd-manager repo though.

dzoeteman on 3 Sep 2019

Thanks for the tips @vijay-veeranki-moj . I have managed to get a lot of stuff running on a new cluster by using a backup from a previous cluster after i have deleted some tokens from the secrets. HOWEVER there are some pods that keeps getting Errors and mainly this is because the kubernetes svc in default namespace includes some master node endpoints from the old cluster. I have tried to delete that to resolve the issue but unfortunately this didn't help as the service got recreated with exactly the same endpoints. I have tried to roll all my masters & nodes but this didn't help at all. As a result several requests to kubernetes service are timing out or cannot find the nodes attached

angeloskaltsikis on 3 Sep 2019

👍1

Edit: for more info: the kubernetes svc gets automatically filled with endpoints based on the master leases, so it doesn't help to remove them from the svc, as they'll just get added again.

dzoeteman on 3 Sep 2019

👍2

Opened a PR at the etcd-manager repo for troubleshooting steps (kopeio/etcd-manager#251)

dzoeteman on 3 Sep 2019

Thanks @dzoeteman

One more question please, I am getting below error after the restore and restart the etcd containers.
Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, square/go-jose: error in cryptographic primitive]]

Then i delete the existing tokens, they get recreated , once i restart the pods this error is fixed.

Problem is We need to find out all the old tokens across all the namespaces and remove them.

Will the restore process do that removing token's eventually, do i need to wait for longer for this to happen?

Why is this issue caused, is this related to Fixing leadership token issues issue?, can this be included in the trouble shooting please.

vijay-veeranki on 4 Sep 2019

@vijay-veeranki-moj Unfortunately, I haven't seen this error before, so I'm unsure of how to help :/

dzoeteman on 4 Sep 2019

@vijay-veeranki-moj I can confirm that i have faced the error as well and got solved by deleting the secrets/tokens (I will perform the whole procedure one more time and i will make an update here which exact secrets i have deleted to solve that).

I have managed to restore the state of the cluster and get everything running after i have deleted the master leases (thanks for the link & thanks @dzoeteman for the update in documentation of how to do that) and every deployment/statefulset that had PVC as the cluster couldn't locate those.
Unfortunately i am not sure if we could automate the deletion of tokens in order to perform the cluster restore easier.

angeloskaltsikis on 5 Sep 2019

It would be nice to have a complete scenario on the backup/restore process on AWS, including the expected log entries at every step.

@marcindulak Is this https://github.com/kopeio/etcd-manager/blob/master/docs/backup-restore.md close to what you're looking for?

pracucci on 5 Sep 2019

It would be nice to have a complete scenario on the backup/restore process on AWS, including the expected log entries at every step.

@marcindulak Is this https://github.com/kopeio/etcd-manager/blob/master/docs/backup-restore.md close to what you're looking for?

Not exactly. The above link provides a conceptual overview, and is a good starting point for a developer who wants to read the code. What I'm looking for is the complete operational scenario: the exact commands to put the etcd cluster in a certain failed state, then the commands executed and their expected outcome in the logs. Something I can give to a system administrator and say: take this document, create a test k8s cluster, verify that the procedure works and create an internal procedure based on this.

I have to add that the discussions in this issue evolved into solving particular cases encountered by various people. I'm looking for having these cases described in the documentation.

marcindulak on 5 Sep 2019

@marcindulak Makes a lot of sense. I'll see what I can do, but hope to get some more feedback on the docs PRs before I work on it, just so I make sure I'm on the right path, and also in the right repo (still really unsure about what should go in the kops repo, and what should in the Etcd-manager repo).

dzoeteman on 5 Sep 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 4 Dec 2019

/remove-lifecycle stale

olemarkus on 4 Dec 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Mar 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 2 Apr 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 2 May 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 2 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow opt-in to etcd3

justinsb · 4Comments

Is it possible to use Custom Admission Controllers/API server flags in kops using Terraform export?

thejsj · 4Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

Confusion with the AWS Route53 readme

yetanotherchris · 3Comments

Sublime does not seem to work on edit

chrislovecnm · 3Comments