Velero: Restic backup fails with Fatal: invalid id \"8bcb4ae8\": no matching ID found\n

Created on 6 May 2020  路  12Comments  路  Source: vmware-tanzu/velero

What steps did you take and what happened:
We have a daily schedule that does restic backup of around 120 volumes to s3.
Only a part of the volumes are successfully backed up. Let's say out of 120, 67 succeed and the rest failed with the following error:

Fatal: invalid id \"8bc4dae8\": no matching ID found\n

The same happens when I trigger one of the failed backup manually. As far as I can see there is no difference from the PVC that succeeds and one that fails.

What did you expect to happen:
All volumes are backed up successfully.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:
I can provide more logs if needed

Environment:

  • Velero version (use velero version): 1.2.0
  • Velero features (use velero client config get features): -
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Needs info Needs investigation Restic

Most helpful comment

I had this issue with a couple of the pods and it was only resolved with a full reset of velero, removed all crds, bucket etc. I had upgraded from 1.2 to 1.3 previously. So far it hasn't reoccurred.

All 12 comments

@VincenzoDo do you have more than one BackupStorageLocation that you're using, i.e. does velero backup-location get show more than one item?

You may be hitting some combination of https://github.com/vmware-tanzu/velero/issues/2192 and https://github.com/vmware-tanzu/velero/issues/2185, both of which have been fixed in the v1.3 series. I'd recommend starting by upgrading your Velero deployment & daemonset to v1.3.2.

Hello @skriss, no just one:

velero backup-location get
NAME      PROVIDER   BUCKET/PREFIX       ACCESS MODE
default   aws        prometheus-backup   ReadWrite

Ok I will start with the version upgrade and get back to you, thanks

I updated both images to the latest version but the problem persists, same errors. Any other suggestions? Is there a place where I can see a more detailed error than:
stderr=Fatal: invalid id \"47462d79\": no matching ID found\n: unable to find summary in restic backup command output"

@VincenzoDo Does this https://github.com/vmware-tanzu/velero/issues/2109#issuecomment-562551041 apply to your case here?

Hi, not really. As far as I understood the comment says that all backups are failing, while in my case it's only like 40% of the backups. I'm off work this week, when I'm back I will try to re-install velero...

I had this issue with a couple of the pods and it was only resolved with a full reset of velero, removed all crds, bucket etc. I had upgraded from 1.2 to 1.3 previously. So far it hasn't reoccurred.

@rust84 Thank you for confirming.
@VincenzoDo Yeah, try re-installing and tell us if you still see this issue.

So I re-deployed velero. I used commands suggested in the doc to delete all deployments and custom resource definition.

The first backup I took of all statefulsets (and pvcs) had around 20 failures. After seeing a timeout related error in the logs I increase the timeout ( - --restic-timeout=6h" ).

The second backup I took had only 2 failures (with the usual error):

time="2020-05-24T23:36:27Z" level=error msg="Error backing up item" backup=velero/daily-20200524230041 error="pod volume backup failed: error running restic backup, stderr=Fatal: invalid id \"ce83b646\": no matching ID found\n: unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:287" name=prometheus-0 namespace= resource=pods

Today it's again up to 23 failures.

Might be related to https://github.com/vmware-tanzu/velero/issues/2539

Should I try a deployment with debug log level? Is there maybe a retry option that can be enabled so that velero or restic retries the failed backups?

Upgraded to version 1.4 but still getting errors:

time="2020-06-18T02:10:42Z" level=error msg="Error backing up item" backup=velero/daily-20200617230018 error="pod volume backup failed: error running restic backup, stderr=: unable to find summary in restic backup command output" error.file="/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:179" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=prometheus-0

@VincenzoDo Can you please tell us what provider you are using to provision volumes?
Can you also please try the troubleshooting steps from https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-641532094, https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-655176937, and https://github.com/vmware-tanzu/velero/issues/2539#issuecomment-655659903

I am closing this issue for inactivity. But please re-open or reach out if you have more questions or need help troubleshooting.

Was this page helpful?
0 / 5 - 0 ratings