What steps did you take and what happened:
I am having an issue where backups of volumes with Restic take more or less the same time even though basically no data has changed in between two backups. For example between tests/experiments I am backing up a namespace with a Nextcloud installation that has around 25 GB of data in a volume, and backups do not seem to be performed incrementally.
I've also noticed by describing a backup with the --details parameter that the name of the pod is included in the Restic backup summary, e.g.:
Restic Backups:
Completed:
nextcloud/nextcloud-7d9dcb445b-7wk29: nextcloud-data, nextcloud-html
So I am wondering, if the name of the pod using a volume changes for example due to a restart or similar, will Restic perform a full backup again instead of an incremental backup even if the volume hasn't actually changed? Would be weird and make incremental backups less useful.
What did you expect to happen:
I would expect incremental backups of volumes to be very quick if no much data has changed since the last backup.
The output of the following commands will help us better understand what's going on:
kubectl logs deployment/velero -n veleroMost recent logs for the latest backup: https://ybin.me/p/d13357ceab3eb47f#sd27nkpJ8QVRE/EizAnFXNchF9H47Aqhs3Rkw7/a0dA=
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yamlapiVersion: velero.io/v1
kind: Backup
metadata:
creationTimestamp: "2019-05-03T16:37:09Z"
generation: 3
labels:
velero.io/storage-location: default
name: nextcloud-15.0.5-before-upgrade
namespace: velero
resourceVersion: "367563"
selfLink: /apis/velero.io/v1/namespaces/velero/backups/nextcloud-15.0.5-before-upgrade
uid: b22e48e1-6dc1-11e9-a45c-960000254b90
spec:
excludedNamespaces: null
excludedResources: null
hooks:
resources: null
includeClusterResources: null
includedNamespaces:
- nextcloud
includedResources: null
labelSelector: null
storageLocation: default
ttl: 720h0m0s
volumeSnapshotLocations: null
status:
completionTimestamp: "2019-05-03T16:54:12Z"
expiration: "2019-06-02T16:37:09Z"
phase: Completed
startTimestamp: "2019-05-03T16:37:09Z"
validationErrors: null
version: 1
volumeSnapshotsAttempted: 0
volumeSnapshotsCompleted: 0
velero backup logs <backupname>https://ybin.me/p/8bce49708d0abd86#OxyL+CKIrp8TLLBnDRnihlkIBdvh4wtSjD2zwg312HE=
Environment:
velero version): Client:
Version: 0.11.0
Git commit: -
Server:
Version: v0.11.0
kubectl version):Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-26T00:04:52Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1-k3s.4", GitCommit:"52f3b42401c93c36467f1fd6d294a3aba26c7def", GitTreeState:"clean", BuildDate:"2019-04-15T22:13+00:00Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Rancher K3S 0.4.0
Hetzner Cloud
/etc/os-release):Ubuntu 18.04
I can confirm that that is the problem. I took two backups making sure the pod was up and running and always the same for both backups as for the previous backup, and both backups were definitely incremental and very quick. Then I deleted the pod so it would be recreated with a different name, and took another backup; it's now definitely performing a full backup rather than incremental.
Is it on the roadmap to remove this limitation? What are possible approaches? Instead of relying on the temporary pod name, use the deployment+Container+ mount Name as id? Or maybe completely separate the restic Volume Backup dependency tonthe pod for Backup but save a mapping? Or so
We just installed velero and I was surprised by this limitation. I feel like I'm missing what the use case for restic is in this case? Very small PVs?
We will definitely be looking at this issue in the upcoming releases. @abh if your pods are not being rescheduled very often, then most of the time you'll still get the incremental backup behavior.
Some notes/thoughts here:
We currently rely on restic to determine the "parent snapshot" for a new backup. Restic looks for the last snapshot for the same target directory, and run from the same host.
For velero, the target directory looks something like: /host_pods/<workload-pod-uid>/volumes/<volume-plugin>/<volume-or-pvc-name>; and the host name is always "velero".
The issue here is that the <workload-pod-uid> changes if a new pod gets created, so restic doesn't detect a parent snapshot, even if we're backing up a PVC we've already backed up.
I think the easiest fix to make here is:
/host_pods/<workload-pod-uid>/volumes/<volume-plugin>/<volume-or-pvc-name>, we should set the working directory to /host_pods/<workload-pod-uid>/volumes/<volume-plugin>, and then set the backup target to just <volume-or-pvc-name>. This way, from restic's perspective, the backup target dir won't change over time even if the pod using it changes.--parent snapshot the first time we take a new PVC backup using this scheme, that links back to the last backup of that PVC using the old path.restic prune will probably run faster. The downside is that we will end up with many more restic repos to manage. Need to think about whether this is actually worthwhile, or not.I've been using a statefulset for some things so that the names of the pods are stable and Restic does incremental backups, but with normal deployments with multiple replicas this would be a very welcome change :)
I've been looking into this some more, found some more detail on the restic docs:
(from https://restic.readthedocs.io/en/stable/040_backup.html):
Please be aware that when you backup different directories (or the directories to be saved have a variable name component like a time/date), restic always needs to read all files and only afterwards can compute which parts of the files need to be saved. When you backup the same directory again (maybe with new or changed files) restic will find the old snapshot in the repo and by default only reads those files that are new or have been modified since the last snapshot.
So it appears that the issue is not that we're not getting incremental backups, but that restic needs to re-read all files to determine what needs to be backed up, rather than just the ones that have changed. I was able to confirm this behavior - taking a second backup of a PV after the pod using it was rescheduled did result in a slower backup, but the overall size of the restic repo did not change.
The best solution still seems to be using a backup target path that does not include the pod's UID if we're backing up a PVC.
if we're backing up a PVC, instead of backing up /host_pods/
/volumes/ / , we should set the working directory to /host_pods/ /volumes/ , and then set the backup target to just .
I had forgotten how restic deals with paths - this approach is probably not going to work, since restic will derive the absolute path to the directory being backed up. Need to do some more thinking on how to get a path without the pod UID embedded.
Pretty sure we can accomplish this using symlinks but I need to do some more experimentation to confirm for sure.
Tested with restic locally and it doesn't try to expand symlinks
After creating a symlink to a directory, I found that if I (1) did a cd /to/symlink; (2) did a restic backup ., I got all of the contents of the referenced directory, and the directory itself was recorded as /to/symlink rather than the referenced directory. I believe this gives us what we want (though I still need to confirm restic didn't actually have to read all the files on a second backup).
Hmm, seeing different behavior when doing a POC of this in the velero restic pods. Continuing to look into this.
It looks like if we specify --parent when doing a restic backup, we can avoid a full rescan even if the directory we're in has changed (this assumes we are setting our working dir to the directory we're backing up before running restic backup .). So, we can do something like the following:
PodVolumeBackups with the PVC they're for (if applicable) at creation timePodVolumeBackups for PVCs, look for the most recent PodVolumeBackup for that PVC--parent flag when running restic backup for the current PodVolumeBackup.@skriss that sounds like a good approach to me. Would it be possible to look up that snapshot ID directly with restic (e.g. if we looked up restic snapshots, looking for a path that includes the same volume name?)
It's definitely possible; restic supports putting tags on snapshots (we already use this) so we'd probably want to tag snapshots with PVC name/UID and then use that to look up the last snapshot ID.
That sounds like it might be a better approach than having to go through PVBs to find the latest one for a PVC.
Most helpful comment
I can confirm that that is the problem. I took two backups making sure the pod was up and running and always the same for both backups as for the previous backup, and both backups were definitely incremental and very quick. Then I deleted the pod so it would be recreated with a different name, and took another backup; it's now definitely performing a full backup rather than incremental.