Velero: Retry failed backups

Created on 24 Apr 2020 · 10Comments · Source: vmware-tanzu/velero

Describe the problem/challenge you have
There are sometimes transient network issues that can prevent a scheduled backup from completing. It would be nice to have a way to automatically retry a PartiallyFailed backup.

Describe the solution you'd like
A field in the backup spec to specify the number of times a backup should be retried. A field in the status to describe how many runs have been attempted. If runs < attempts, then move a PartiallyFailed backup back into New and let the backup reattempt.

Environment:

Velero version (use velero version): 1.3.2

EnhancemenUser Reviewed Q2 2021

Source

cblecker

👍10 👀3

Most helpful comment

@eleanor-millman Yes, that's a workaround.. but having a workaround doesn't actually negate the feature request. Manually retrying a backup requires human intervention. This feature request is to allow the user to tell Velero, "in the case that this fails, please retry up to X many times".

cblecker on 12 May 2021

👍6

All 10 comments

We see a _lot_ of flaky backups and this feature would be really helpful:

time="2020-05-05T02:04:22Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = error putting block 00000000: Put https://redactedaccount.blob.core.windows.net/redactedcontianer/backups/hourly-k8s-backup-20200505020014/hourly-k8s-backup-20200505020014.tar.gz?blockid=00000000&comp=block: write tcp 172.16.42.216:38834->52.239.207.100:443: write: connection reset by peer" key=velero/hourly-k8s-backup-20200505020014 logSource="pkg/controller/backup_controller.go:265"

dharmab on 5 May 2020

A few more examples that worked fine after creating a new Backup:

time="2020-05-11T18:18:23Z" level=error msg="Error listing items" backup=openshift-velero/hourly-object-backup-20200511181749 error="unexpected error when reading response body. Please retry. Original error: http2: server sent GOAWAY and closed the connection; LastStreamID=487, ErrCode=NO_ERROR, debug=\"\"" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/resource_backupper.go:229" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*defaultResourceBackupper).backupResource" group=v1 logSource="pkg/backup/resource_backupper.go:229" namespace= resource=secrets

time="2020-05-11T20:18:40Z" level=error msg="Error listing items" backup=openshift-velero/hourly-object-backup-20200511201749 error="Get https://10.120.0.1:443/apis/apps/v1/replicasets: unexpected EOF" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/resource_backupper.go:229" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*defaultResourceBackupper).backupResource" group=apps/v1 logSource="pkg/backup/resource_backupper.go:229" namespace= resource=replicasets

jewzaam on 11 May 2020

Same here, we are also getting similar issues which are just fixed by retrying.

turkenh on 24 Jun 2020

just to add to the party, another error
time="2020-10-07T01:01:38Z" level=error msg="Error listing items" error="etcdserver: leader changed" error.file="/github.com/vmware-tanzu/velero/pkg/backup/item_collector.go:234" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemCollector).getResourceItems" group=storage.k8s.io/v1 logSource="pkg/backup/item_collector.go:234" namespace= resource=volumeattachments

georgettica on 7 Oct 2020

Hello Velero team (@carlisia),

We saw this issue and want to have a go at implementing the request. It appears to be a pretty straightforward request:

add a spec.maxRetries field to the Backups CRD
add a status.retryAttempts field to the Backups CRD
modify the controller to set a backup's phase to New if it is in the state PartiallyFailed and increment the retryAttempts until it reaches maxRetries
When describing a backup, add fields "MaxRetries" and "RetryAttempts" to the output, beneath the "Phase" line.

Do these sound like reasonable changes?

Thanks,
Clay and @astrieanna

KauzClay on 19 Jan 2021

Hello again Velero team,

@astrieanna and I poked around at this some more, and we realized it was not as simple as we thought.

As a small recap, we got it to the point where it would make a second attempt, but then the retry fails on the second attempt because the backup already exists in the object storage.

Here, we realized we just aren't familiar enough with the backup process yet 😅

With partial failures, is it okay that a backup already exists? Or should we clean everything up and try again? Do you have different kinds of partial failures that would mean different things?

KauzClay on 22 Jan 2021

Closing because workaround is to rerun the backup.

eleanor-millman on 11 May 2021

cblecker on 12 May 2021

👍6

On some of our larger clusters, retrying a backup from scratch can take many minutes. I'd rather Velero spend a few seconds retrying the upload.

dharmab on 12 May 2021

@cblecker Agreed that the workaround isn't as good as having the feature implemented. Unfortunately, the Velero team is stretched way too thin to implement most of the feature requests that come in. Our issue backlog had grown to 411 issues as of last week and we felt that that wasn't great for the community, both for requesters like you who have your requests sitting in limbo and for Velero developers who have this mountain of issues to stare at. We are currently triaging them into issues we hope to tackle in 1.7/1.8 (i.e. in 2021), issues we hope to tackle in 2.0 (sometime in 2022 and perhaps 2023) and then the rest. We are closing the rest for now, since we feel like issues that won't be tackled for 2 years are not helped by sitting open, in limbo. While not a perfect decision mechanism, one of the things that lead us to prioritize some issues over others was if a workaround existed.

That being said, this triage is regarding work that the maintainers and active contributers will be doing. If you or anyone else who commented on this issue are able to make a PR, we are delighted to chat with you about possible design of the feature and are trying to prioritize PR review (something that we were falling behind on with too much on our plates). So please let us know if this is something that interests you and if you want a suggestion on how to implement this.

eleanor-millman on 12 May 2021

Was this page helpful?

0 / 5 - 0 ratings