Velero: [Epic] Backup Replication

Created on 25 Sep 2017 · 16Comments · Source: vmware-tanzu/velero

User Stories
As a cluster administrator, I would like to define a replication policy for my backups which will ensure that copies exist in other availability zones or regions. This will allow me to restore a cluster in case of an AZ or region failure.

Non-Goals

Cross-cloud replication of backups
Cross-account replication of backups

Features

[ ] ?

_Original Issue Description_

There are a few different dimensions of a DR strategy that may be worth consideration. For AWS deployments the trade-offs the complexity of running Multi-AZ are fairly negligible if you stay in the same region. As such the Single Region/Multi-AZ deployment is extremely common.

An additional requirement often is having the ability to restore in another region with more relaxed RTO/RPO in the case of an entire region going down.

Looking over #101 brought a few things to mind, and a large wish list might include:

Ability to specify additional block storage providers for syncing to additional regions (or a different type of block storage provider that would simply execute the clone to a different region)
Ability to map AZs for a restoration (maybe similar to Namespaces but preferably just transparently for the user) to allow for something like us-east-1a -> us-west-2b.
Writing backup data to an additional bucket in alternate region

Some of these are certainly available today to users (copying snapshots and s3 data) but require additional external integrations to function properly. As a user it would be more convenient if this were able to be done in a consolidated way.

Breaking change EnhancemenUser Epic Icebox Reviewed Q2 2021

Source

jrnt30

👍15

Most helpful comment

Similar scenario for us, I think, and we are using the following manual workaround:

# Make a backup on the first cluster
kubectx my-first-cluster
velero backup create my-backup

# Switch to new cluster and restore the backup
kubectx my-second-cluster
velero restore create --from-backup my-backup

# Find the restored disk name
gcloud config configurations activate my-second-project
gcloud compute disks list

# Move the disk to the necessary region
gcloud compute disks move restore-xyz --destination-zone $my-second-cluster-zone

# Ensure the PV is set to use the retain reclaim policy then delete the old resources
kubectl patch pv mongo-volume-mongodb-0 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
kubectl delete statefulset mongodb
kubectl delete pvc mongo-volume-mongodb-0

# Recreate the restored stateful set with references for the new volume
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  selector:
    matchLabels:
      app: mongodb
  serviceName: "mongodb"
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      containers:
        - name: mongo
          image: mongo
          command:
            - mongod
            - "--bind_ip"
            - 0.0.0.0
            - "--smallfiles"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: mongo-volume
              mountPath: /data/db
  volumeClaimTemplates:
  - metadata:
      name: mongo-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      storageClassName: ""
      volumeName: "mongo-volume-mongodb-0"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mongo-volume-mongodb-0
spec:
  storageClassName: ""
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: "restore-xyz"
    fsType: ext4

EOF

dijitali on 3 Jun 2019

👍2

All 16 comments

@jbeda some of what @jrnt30 is describing sounds similar to your idea of "backup targets"

ncdc on 25 Sep 2017

I just was going to post this as a feature request. :)

I just tried to do this from eastus to westus in Azure and started to think about how we could copy the snapshot and create the disk in the correct region. We could possibly have a restore target config? I also like the idea of creating multiple backups to other regions in case a region goes down or a cluster and its resources get deleted.

jimzim on 13 Nov 2017

👍1

@jimzim this is definitely something we need to spec out and do! We've been kicking around the idea of a "backup target", which would replace the current Config kind. You could define as many targets as you wish, and when you perform a backup, you would then specify which target to use. There are some UX issues to reason through here...

ncdc on 13 Nov 2017

👍1

@ncdc Maybe we can discuss this briefly at KubeCon? I have begun to make this work on Azure, but before I go too much further it would be good to talk about what your planned architecture is.

jimzim on 29 Nov 2017

Sounds great!

On Wed, Nov 29, 2017 at 5:59 PM Jim Zimmerman notifications@github.com
wrote:

@ncdc https://github.com/ncdc Maybe we can discuss this briefly at
KubeCon? I have begun to make this work on Azure, but before I go too much
further it would be good to talk about what your planned architecture is.

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/heptio/ark/issues/103#issuecomment-348025540, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAABYoh9BCcUoAc0fXI8AEDsX5VP9gqjks5s7eHsgaJpZM4PingG
.

ncdc on 30 Nov 2017

This is very much what i'm thinking. We need to think about backup targets, restore sources and ways to munge stuff with a pipeline. Sounds like we are all thinking similar things.

jbeda on 2 Dec 2017

On Azure, you can create a snapshot into a different resource group than the one that the persistent disk is on, which means the snapshots could be created directly into the AZURE_BACKUP_RESOURCE_GROUP instead of AZURE_RESOURCE_GROUP.

Then, cross-RG restores should be quite simple as the source of the data will always be consistent and there should be no refs to AZURE_RESOURCE_GROUP.

I'm not sure if same-Location is a limitation of this -- I've only tried this on two resource groups that are in the same Azure Location.

The command/output I used to test this:

az snapshot create --name foo --resource-group Ark_Dev-Kube --source '/subscriptions/xxx/resourceGroups/my-Dev-Kube1/providers/Microsoft.Compute/disks/devkube1-dynamic-pvc-0bbf7e11-9e82-11e7-a717-000d3af4357e'
  DiskSizeGb  Location    Name    ProvisioningState    ResourceGroup    TimeCreated
------------  ----------  ------  -------------------  ---------------  --------------------------------
           5  canadaeast  foo     Succeeded            Ark_Dev-Kube     2018-01-09T16:21:58.398476+00:00

and the foo snapshot was created in Ark_Dev-Kube even though the disk is in my-Dev-Kube1.

rocketraman on 9 Jan 2018

For reference, this is the current Ark Backup Replication design.

rosskukulinski on 24 Jun 2018

We've created a document of scenarios that we'll use to inform the design decisions for this project.

We also have a document where we're discussing more detailed changes to the Ark codebase from which we'll generate a list of specific work items.

Members of the [email protected] google group have comment access to both of these documents for anyone who would like to share their thoughts on these.

nrb on 10 Jul 2018

👍1

Hello, any updates on this? I have quite a few customers interested in using ARC to DR MD PVs.

knee-berts on 7 Feb 2019

@knee-berts no major updates here; we're actively working towards a v1.0 release and this issue will be tackled after that.

We'd definitely be interested in hearing details of your customers' needs so we can make sure that what we're planning on implementing lines up!

skriss on 7 Feb 2019

When restore a backup, we don't know the new cluster's availability zone in AWS. Since AWS does not support binding EBS volumes to an EC2 node that is in a different availability zone, we're forced to create the new cluster in the same availability zone. As we'd like to get rid of this requirement, I'm looking forward for this issue to be fixed.

muvaf on 27 Mar 2019

Similar scenario for us, I think, and we are using the following manual workaround:

# Make a backup on the first cluster
kubectx my-first-cluster
velero backup create my-backup

# Switch to new cluster and restore the backup
kubectx my-second-cluster
velero restore create --from-backup my-backup

# Find the restored disk name
gcloud config configurations activate my-second-project
gcloud compute disks list

# Move the disk to the necessary region
gcloud compute disks move restore-xyz --destination-zone $my-second-cluster-zone

# Ensure the PV is set to use the retain reclaim policy then delete the old resources
kubectl patch pv mongo-volume-mongodb-0 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
kubectl delete statefulset mongodb
kubectl delete pvc mongo-volume-mongodb-0

# Recreate the restored stateful set with references for the new volume
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongodb
spec:
  selector:
    matchLabels:
      app: mongodb
  serviceName: "mongodb"
  template:
    metadata:
      labels:
        app: mongodb
    spec:
      containers:
        - name: mongo
          image: mongo
          command:
            - mongod
            - "--bind_ip"
            - 0.0.0.0
            - "--smallfiles"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: mongo-volume
              mountPath: /data/db
  volumeClaimTemplates:
  - metadata:
      name: mongo-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      storageClassName: ""
      volumeName: "mongo-volume-mongodb-0"
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: mongo-volume-mongodb-0
spec:
  storageClassName: ""
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: "restore-xyz"
    fsType: ext4

EOF

dijitali on 3 Jun 2019

👍2

Hi, is there any ETA for this? It sounds like a basic feature to be able to use backup to recover from an AZ failure.

https://docs.google.com/document/d/1vGz53OVAPynrgi5sF0xSfKKr32NogQP-xgXA1PB6xMc/edit#heading=h.yuq6zfblfpvs sounded promising

jujugrrr on 4 Jun 2020

@jujugrrr we have cross-AZ/region backup & restore on our roadmap. If you're interested in contributing in any way (requirements, design work, etc), please let us know!

cc @stephbman

skriss on 4 Jun 2020

You don't need backup replication to support multi-zone and multi-region for GCP/GKE with the K8s VolumeSnapshot beta support of Velero v1.4. See https://github.com/vmware-tanzu/velero/issues/1624#issuecomment-671061689