Kops: etcd backups

Created on 14 Feb 2017 · 31Comments · Source: kubernetes/kops

We should use commands similar to this:

etcd2: etcdctl backup --data-dir=foo --backup-dir=bar and then push the result up to storage etcd3: the command is a bit different: etcdctl --endpoints=127.0.0.1:2379 snapshot save /backup/dir/snapshot.db
also for etcd3 you need to set an env var: ETCDCTL_API=3

lifecyclrotten

Source

justinsb

👍3

Most helpful comment

@chrislovecnm Is there any documentation available yet?

From what I can see one could use the following config:

  etcdClusters:
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/main/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: main
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/events/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: events

Can an existing cluster be updated with that? And does it need anything else? How often does it backup per default?

Globegitter on 8 Apr 2018

👍6

All 31 comments

This isn't really a kops thing either, if we can do this in a non-kops-specific way

justinsb on 14 Feb 2017

@justinsb do you want to create a command under kops? If so, I think there should also be a command to restore the backup.

kopiczko on 4 Mar 2017

@justinsb, I'd be happy to help with this. Do you have a particular design in mind for it? I'm wondering if we want the ability to schedule a regular backup and upload to S3/Cloud Storage?

robinpercy on 15 Sep 2017

@robinpercy can the etcd operator be setup for just backups? We should have an addon for this, someone must have an operator. We should have update and rolling-update call the operator as well, when the masters are touched.

chrislovecnm on 15 Sep 2017

@chrislovecnm good call. I'll have a look at what the operator (and others) provide. +1 for calling it before rolling updates.

robinpercy on 16 Sep 2017

Looking at the operator, it appears you need to create the etcd cluster via the operator. Would need to dig through the code. We should just make this simple. A cronjob or a pod that calls backup and stored the data on a pvc. An optional addon.

chrislovecnm on 16 Sep 2017

Figured I would chime in here and provide another data point as I created some backup/restore tooling for etcd3 prior to using kops to manage our k8 clusters.

(none of this applies to etcd2).

The Setup

Similar to kops, we ran a sidecar program that was responsible for bootstrapping the etcd servers, however it would continue running after etcd had started to perform periodic backups.

Backup

Every 10 minutes the sidecar would create a backup via the Snapshot method on the Etcd client, and pushed the resulting data to a versioned S3 bucket.

We ran this every 10 minutes, and set a lifecycle policy to expire old versions of the backups.

Restore

Unfortunately the Etcd client doesn't provide a way to restore without using the CLI.

For our purposes we used the code from the snapshot_command.go source as a guideline, but we could have also done it via shelling out to the CLI tool instead.

On instance startup, the sidecar would check if there was an available backup in S3, and if so restore it to the filesystem before starting Etcd.
If there wasn't a backup, it would start in a new cluster configuration.

marshallbrekka on 20 Sep 2017

👍1

After being sidetracked for the last couple of weeks, here's what I'm thinking. For now I'm targeting etcd2, since it's the only option supported by kops, but I think this will be easily adapted (and even simplified for etcd3).

General Approach

Install a privileged pod on each master
the pod contains 2 containers and an emptyDir volume
container 1 is responsible for backing up the "local" etcd data dir to a known location on the emptyDir volume
container 2 is responsible for watching that known location and doing something with the backup
we will provide a reasonable default implementation of container 2 that ships the etcd backup to S3
users can easily customize this behaviour by overriding the container 2 image.

Key considerations:

Backups require access to the filesystem in etcd2, they can't be done remotely
We don't want to take backups off of minority members during a partition (thus only backing up the leader)
Offsite storage mechanisms will be use-case specific and should be easily customized

I'm currently not clear on what's possible using the add-ons mechanism. In particular, can an add-on be defined in a way that reads the master count before deployment (e.g. will set replicas to 1 or 3 depending on cluster config)? Or is this better done with static manifests on the masters?

@justinsb @chrislovecnm what do you think of the above design?

robinpercy on 6 Oct 2017

Having this as an addon is fine, or we can deploy a sidecar with etcd? Do we need two containers? Having a single loop seems reasonable. Should we backup to pvc or s3? I am thinking PVC is better for IAM perms, we do not need yet another bucket.

Do we need backups or can we do snapshots? https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot

How do we monitor that backups are occurring?

chrislovecnm on 6 Oct 2017

@chrislovecnm:

So, the problem I see with a PVC is that we'd need one for each backup pod, and we end up with backups spread across each (based on whichever pod happens to be beside the leader at any given time). The two-container approach is just there so users can easily override that "upload" behavior however they want.

I've avoided the snapshot route due to the conversation here:
https://github.com/kubernetes/kubernetes/issues/40027#issuecomment-288930752 and related kops issue: https://github.com/kubernetes/kops/issues/1506. In my experience etcd2 hasn't been very resilient to minor deltas between member data stores. The etcdctl backup approach does make for a more tedious restore process, but it seems to be the only one endorsed by CoreOS (for etcd2).

I'm open to suggestions about monitoring. I was thinking of exposing a prometheus-compatible endpoint on each backup pod that includes the timestamp, size and location of the latest backup. Then users can collect that however they like.

robinpercy on 6 Oct 2017

@edulop91 we would love your help with this. I think a separate controller would be best. Protokube is not ha aware, and would need to be it we put it in protokube.

The benefit of putting it in protokube is having a restore that could be triggered w/o k8s running.

Let just start iterating and make it awesome throwing more iterations.

chrislovecnm on 27 Nov 2017

In case it is useful 'kube-aws' does what you're discussing using systemd and a handy etcdadm script crafted by @mumoshu and friends. It takes 1-minute backups to S3, automatically 'resets' failed etcd nodes, and automatically recovers a failed etcd cluster from the S3 backup. The differences that might not fit your use case are kube-aws uses etcd3 and dedicated etcd instances. But logic might be relevant.

https://github.com/kubernetes-incubator/kube-aws/blob/master/docs/advanced-topics/etcd-backup-and-restore.md

whereisaaron on 28 Nov 2017

We merged the alpha version of using kopeio etcd manager. This will be available in kops 1.9.

chrislovecnm on 23 Feb 2018

JFYI but in kube-aws, etcd node gets a reset only when its data seemed broken.

When the etcd node was terminated by any transient issue, it just come back with the same identity(same EBS vol and Elastic IP) and continue the job as usual. No reset in this case.

I find this simpler and reliable as we don't need to manipulate etcd cluster membership at all.

mumoshu on 25 Feb 2018

automatically 'resets' failed etcd nodes, and automatically recovers a failed etcd cluster from the S3 backup

It was a compliment to this info!
Not sure kops or etcd-manager doing things differently, but hope this helps anyway.

mumoshu on 25 Feb 2018

@chrislovecnm Is there any documentation available yet?

From what I can see one could use the following config:

  etcdClusters:
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/main/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: main
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/events/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: events

Can an existing cluster be updated with that? And does it need anything else? How often does it backup per default?

Globegitter on 8 Apr 2018

👍6

Backups are being collected every 5 minutes and stored to s3.

Unfortunately my current tests show that only the ETCD Main cluster is backed up.

s1lv3r40 on 11 May 2018

I'd rather suggest to create a add-on to enable ark than snapshot to s3. It's way easier to manage backups using ark

gianrubio on 6 Jun 2018

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 4 Sep 2018

oded-dd on 4 Oct 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 3 Nov 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 3 Dec 2018

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 3 Dec 2018

@chrislovecnm Is there any documentation available yet?

From what I can see one could use the following config:
  etcdClusters:
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/main/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: main
  - backups:
      backupStore: s3://bucket/cluster.example.com/backups/etcd/events/
    etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    name: events
Can an existing cluster be updated with that? And does it need anything else? How often does it backup per default?

where the hell is this documented?
my main etcd db is being snapshotted to s3, but the events one not... I have no idea why.

marekaf on 27 Feb 2019

👍2

I agree with @marekaf . This should be reopened. Its not complete as there is no documentation on this functionality.

mward29 on 3 Apr 2019

/remove-lifecycle rotten

chrislovecnm on 5 Apr 2019

I've applied the etcdClusters.backups.backupStore configs, I can see both main and events for kops-1.11.1 in the S3 bucket. But I only see a json file with the content:

{
  "memberCount": 1,
  "etcdVersion": "3.2.24"
}

While my etcd version defined in the master's user-data is

etcdClusters:
  events:
    image: k8s.gcr.io/etcd:2.2.1
    version: 2.2.1

In both events.yaml and main.yaml I can se the command had changed, but no tar.gz in the bucket yet. Here the example of my main.yaml command secction

  - command:
    - /bin/sh
    - -c
    - mkfifo /tmp/pipe; (tee -a /var/log/etcd.log < /tmp/pipe & ) ; exec /etcd-manager
      --backup-store=s3://bucket/cluster.example.com/backups/etcd/main/
      --client-urls=https://__name__:4001 --cluster-name=etcd --containerized=true
      --dns-suffix=.internal.cluster.example.com --etcd-insecure=false --grpc-port=3996
      --insecure=false --peer-urls=https://__name__:2380 --quarantine-client-urls=https://__name__:3994
      --v=6 --volume-name-tag=k8s.io/etcd/main --volume-provider=aws --volume-tag=k8s.io/etcd/main
      --volume-tag=k8s.io/role/master=1 --volume-tag=kubernetes.io/cluster/cluster.example.com=owned
      > /tmp/pipe 2>&1
    image: kopeio/etcd-manager:3.0.20190516
    name: etcd-manager

But the question is (besides the leak of documentation about this feature). How do we restore the backup?

gdurandvadas on 20 May 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Aug 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 17 Sep 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 17 Oct 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot on 17 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow opt-in to etcd3

justinsb · 4Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

Kubectl top nodes not working with the metrics server

minasys · 3Comments

Go panic when deleting IG

drewfisher314 · 4Comments

Missing documentation for 'upgrade' command

argusua · 5Comments