What steps did you take and what happened:
We have a daily schedule that does restic backup of around 120 volumes to s3.
Only a part of the volumes are successfully backed up. Let's say out of 120, 67 succeed and the rest failed with the following error:
Fatal: invalid id \"8bc4dae8\": no matching ID found\n
The same happens when I trigger one of the failed backup manually. As far as I can see there is no difference from the PVC that succeeds and one that fails.
What did you expect to happen:
All volumes are backed up successfully.
The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yamlvelero backup logs <backupname>Anything else you would like to add:
I can provide more logs if needed
Environment:
velero version): 1.2.0velero client config get features): -kubectl version):@VincenzoDo do you have more than one BackupStorageLocation that you're using, i.e. does velero backup-location get show more than one item?
You may be hitting some combination of https://github.com/vmware-tanzu/velero/issues/2192 and https://github.com/vmware-tanzu/velero/issues/2185, both of which have been fixed in the v1.3 series. I'd recommend starting by upgrading your Velero deployment & daemonset to v1.3.2.
Hello @skriss, no just one:
velero backup-location get
NAME PROVIDER BUCKET/PREFIX ACCESS MODE
default aws prometheus-backup ReadWrite
Ok I will start with the version upgrade and get back to you, thanks
I updated both images to the latest version but the problem persists, same errors. Any other suggestions? Is there a place where I can see a more detailed error than:
stderr=Fatal: invalid id \"47462d79\": no matching ID found\n: unable to find summary in restic backup command output"
@VincenzoDo Does this https://github.com/vmware-tanzu/velero/issues/2109#issuecomment-562551041 apply to your case here?
Hi, not really. As far as I understood the comment says that all backups are failing, while in my case it's only like 40% of the backups. I'm off work this week, when I'm back I will try to re-install velero...
I had this issue with a couple of the pods and it was only resolved with a full reset of velero, removed all crds, bucket etc. I had upgraded from 1.2 to 1.3 previously. So far it hasn't reoccurred.
@rust84 Thank you for confirming.
@VincenzoDo Yeah, try re-installing and tell us if you still see this issue.
So I re-deployed velero. I used commands suggested in the doc to delete all deployments and custom resource definition.
The first backup I took of all statefulsets (and pvcs) had around 20 failures. After seeing a timeout related error in the logs I increase the timeout ( - --restic-timeout=6h" ).
The second backup I took had only 2 failures (with the usual error):
time="2020-05-24T23:36:27Z" level=error msg="Error backing up item" backup=velero/daily-20200524230041 error="pod volume backup failed: error running restic backup, stderr=Fatal: invalid id \"ce83b646\": no matching ID found\n: unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:287" name=prometheus-0 namespace= resource=pods
Today it's again up to 23 failures.
Might be related to https://github.com/vmware-tanzu/velero/issues/2539
Should I try a deployment with debug log level? Is there maybe a retry option that can be enabled so that velero or restic retries the failed backups?
Upgraded to version 1.4 but still getting errors:
time="2020-06-18T02:10:42Z" level=error msg="Error backing up item" backup=velero/daily-20200617230018 error="pod volume backup failed: error running restic backup, stderr=: unable to find summary in restic backup command output" error.file="/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:179" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=prometheus-0
@VincenzoDo Can you please tell us what provider you are using to provision volumes?
Can you also please try the troubleshooting steps from https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-641532094, https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-655176937, and https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-655659903
I am closing this issue for inactivity. But please re-open or reach out if you have more questions or need help troubleshooting.
Most helpful comment
I had this issue with a couple of the pods and it was only resolved with a full reset of velero, removed all crds, bucket etc. I had upgraded from 1.2 to 1.3 previously. So far it hasn't reoccurred.