Velero: Restic seemingly taking full backups each time despite no data has changed

Created on 3 May 2019  路  16Comments  路  Source: vmware-tanzu/velero

What steps did you take and what happened:

I am having an issue where backups of volumes with Restic take more or less the same time even though basically no data has changed in between two backups. For example between tests/experiments I am backing up a namespace with a Nextcloud installation that has around 25 GB of data in a volume, and backups do not seem to be performed incrementally.

I've also noticed by describing a backup with the --details parameter that the name of the pod is included in the Restic backup summary, e.g.:

Restic Backups:
  Completed:
    nextcloud/nextcloud-7d9dcb445b-7wk29: nextcloud-data, nextcloud-html

So I am wondering, if the name of the pod using a volume changes for example due to a restart or similar, will Restic perform a full backup again instead of an incremental backup even if the volume hasn't actually changed? Would be weird and make incremental backups less useful.

What did you expect to happen:

I would expect incremental backups of volumes to be very quick if no much data has changed since the last backup.

The output of the following commands will help us better understand what's going on:

  • kubectl logs deployment/velero -n velero

Most recent logs for the latest backup: https://ybin.me/p/d13357ceab3eb47f#sd27nkpJ8QVRE/EizAnFXNchF9H47Aqhs3Rkw7/a0dA=

  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
  creationTimestamp: "2019-05-03T16:37:09Z"
  generation: 3
  labels:
    velero.io/storage-location: default
  name: nextcloud-15.0.5-before-upgrade
  namespace: velero
  resourceVersion: "367563"
  selfLink: /apis/velero.io/v1/namespaces/velero/backups/nextcloud-15.0.5-before-upgrade
  uid: b22e48e1-6dc1-11e9-a45c-960000254b90
spec:
  excludedNamespaces: null
  excludedResources: null
  hooks:
    resources: null
  includeClusterResources: null
  includedNamespaces:
  - nextcloud
  includedResources: null
  labelSelector: null
  storageLocation: default
  ttl: 720h0m0s
  volumeSnapshotLocations: null
status:
  completionTimestamp: "2019-05-03T16:54:12Z"
  expiration: "2019-06-02T16:37:09Z"
  phase: Completed
  startTimestamp: "2019-05-03T16:37:09Z"
  validationErrors: null
  version: 1
  volumeSnapshotsAttempted: 0
  volumeSnapshotsCompleted: 0

  • velero backup logs <backupname>

https://ybin.me/p/8bce49708d0abd86#OxyL+CKIrp8TLLBnDRnihlkIBdvh4wtSjD2zwg312HE=

Environment:

  • Velero version (use velero version):
Client:
    Version: 0.11.0
    Git commit: -
Server:
    Version: v0.11.0
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:04:52Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1-k3s.4", GitCommit:"52f3b42401c93c36467f1fd6d294a3aba26c7def", GitTreeState:"clean", BuildDate:"2019-04-15T22:13+00:00Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version:

Rancher K3S 0.4.0

  • Cloud provider or hardware configuration:

Hetzner Cloud

  • OS (e.g. from /etc/os-release):

Ubuntu 18.04

P1 - Important Restic Restic - GA

Most helpful comment

I can confirm that that is the problem. I took two backups making sure the pod was up and running and always the same for both backups as for the previous backup, and both backups were definitely incremental and very quick. Then I deleted the pod so it would be recreated with a different name, and took another backup; it's now definitely performing a full backup rather than incremental.

All 16 comments

I can confirm that that is the problem. I took two backups making sure the pod was up and running and always the same for both backups as for the previous backup, and both backups were definitely incremental and very quick. Then I deleted the pod so it would be recreated with a different name, and took another backup; it's now definitely performing a full backup rather than incremental.

Is it on the roadmap to remove this limitation? What are possible approaches? Instead of relying on the temporary pod name, use the deployment+Container+ mount Name as id? Or maybe completely separate the restic Volume Backup dependency tonthe pod for Backup but save a mapping? Or so

We just installed velero and I was surprised by this limitation. I feel like I'm missing what the use case for restic is in this case? Very small PVs?

We will definitely be looking at this issue in the upcoming releases. @abh if your pods are not being rescheduled very often, then most of the time you'll still get the incremental backup behavior.

Some notes/thoughts here:

We currently rely on restic to determine the "parent snapshot" for a new backup. Restic looks for the last snapshot for the same target directory, and run from the same host.

For velero, the target directory looks something like: /host_pods/<workload-pod-uid>/volumes/<volume-plugin>/<volume-or-pvc-name>; and the host name is always "velero".

The issue here is that the <workload-pod-uid> changes if a new pod gets created, so restic doesn't detect a parent snapshot, even if we're backing up a PVC we've already backed up.

I think the easiest fix to make here is:

  • if we're backing up a PVC, instead of backing up /host_pods/<workload-pod-uid>/volumes/<volume-plugin>/<volume-or-pvc-name>, we should set the working directory to /host_pods/<workload-pod-uid>/volumes/<volume-plugin>, and then set the backup target to just <volume-or-pvc-name>. This way, from restic's perspective, the backup target dir won't change over time even if the pod using it changes.

    • when we make this change, it will break the incremental backup chain for all existing PVC backups. we could look at adding code to explicitly specify the --parent snapshot the first time we take a new PVC backup using this scheme, that links back to the last backup of that PVC using the old path.

    • we could also consider moving PVC backups into their own restic repositories (one repo per PVC). The benefit of this is less repository lock contention, and things like restic prune will probably run faster. The downside is that we will end up with many more restic repos to manage. Need to think about whether this is actually worthwhile, or not.

  • For non-PVC pod volumes, I think we want to leave things as-is, since the identity/lifecycle of these volumes really are scoped to a single pod.

I've been using a statefulset for some things so that the names of the pods are stable and Restic does incremental backups, but with normal deployments with multiple replicas this would be a very welcome change :)

I've been looking into this some more, found some more detail on the restic docs:

(from https://restic.readthedocs.io/en/stable/040_backup.html):

Please be aware that when you backup different directories (or the directories to be saved have a variable name component like a time/date), restic always needs to read all files and only afterwards can compute which parts of the files need to be saved. When you backup the same directory again (maybe with new or changed files) restic will find the old snapshot in the repo and by default only reads those files that are new or have been modified since the last snapshot.

So it appears that the issue is not that we're not getting incremental backups, but that restic needs to re-read all files to determine what needs to be backed up, rather than just the ones that have changed. I was able to confirm this behavior - taking a second backup of a PV after the pod using it was rescheduled did result in a slower backup, but the overall size of the restic repo did not change.

The best solution still seems to be using a backup target path that does not include the pod's UID if we're backing up a PVC.

if we're backing up a PVC, instead of backing up /host_pods//volumes//, we should set the working directory to /host_pods//volumes/, and then set the backup target to just .

I had forgotten how restic deals with paths - this approach is probably not going to work, since restic will derive the absolute path to the directory being backed up. Need to do some more thinking on how to get a path without the pod UID embedded.

Pretty sure we can accomplish this using symlinks but I need to do some more experimentation to confirm for sure.

Tested with restic locally and it doesn't try to expand symlinks

After creating a symlink to a directory, I found that if I (1) did a cd /to/symlink; (2) did a restic backup ., I got all of the contents of the referenced directory, and the directory itself was recorded as /to/symlink rather than the referenced directory. I believe this gives us what we want (though I still need to confirm restic didn't actually have to read all the files on a second backup).

Hmm, seeing different behavior when doing a POC of this in the velero restic pods. Continuing to look into this.

It looks like if we specify --parent when doing a restic backup, we can avoid a full rescan even if the directory we're in has changed (this assumes we are setting our working dir to the directory we're backing up before running restic backup .). So, we can do something like the following:

  • label PodVolumeBackups with the PVC they're for (if applicable) at creation time
  • when processing PodVolumeBackups for PVCs, look for the most recent PodVolumeBackup for that PVC
  • use that PVB's snapshot ID as the value of the --parent flag when running restic backup for the current PodVolumeBackup.

@skriss that sounds like a good approach to me. Would it be possible to look up that snapshot ID directly with restic (e.g. if we looked up restic snapshots, looking for a path that includes the same volume name?)

It's definitely possible; restic supports putting tags on snapshots (we already use this) so we'd probably want to tag snapshots with PVC name/UID and then use that to look up the last snapshot ID.

That sounds like it might be a better approach than having to go through PVBs to find the latest one for a PVC.

Was this page helpful?
0 / 5 - 0 ratings