Describe the problem/challenge you have
There are sometimes transient network issues that can prevent a scheduled backup from completing. It would be nice to have a way to automatically retry a PartiallyFailed backup.
Describe the solution you'd like
A field in the backup spec to specify the number of times a backup should be retried. A field in the status to describe how many runs have been attempted. If runs < attempts, then move a PartiallyFailed backup back into New and let the backup reattempt.
Environment:
velero version): 1.3.2We see a _lot_ of flaky backups and this feature would be really helpful:
time="2020-05-05T02:04:22Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = error putting block 00000000: Put https://redactedaccount.blob.core.windows.net/redactedcontianer/backups/hourly-k8s-backup-20200505020014/hourly-k8s-backup-20200505020014.tar.gz?blockid=00000000&comp=block: write tcp 172.16.42.216:38834->52.239.207.100:443: write: connection reset by peer" key=velero/hourly-k8s-backup-20200505020014 logSource="pkg/controller/backup_controller.go:265"
A few more examples that worked fine after creating a new Backup:
time="2020-05-11T18:18:23Z" level=error msg="Error listing items" backup=openshift-velero/hourly-object-backup-20200511181749 error="unexpected error when reading response body. Please retry. Original error: http2: server sent GOAWAY and closed the connection; LastStreamID=487, ErrCode=NO_ERROR, debug=\"\"" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/resource_backupper.go:229" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*defaultResourceBackupper).backupResource" group=v1 logSource="pkg/backup/resource_backupper.go:229" namespace= resource=secrets
time="2020-05-11T20:18:40Z" level=error msg="Error listing items" backup=openshift-velero/hourly-object-backup-20200511201749 error="Get https://10.120.0.1:443/apis/apps/v1/replicasets: unexpected EOF" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/resource_backupper.go:229" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*defaultResourceBackupper).backupResource" group=apps/v1 logSource="pkg/backup/resource_backupper.go:229" namespace= resource=replicasets
Same here, we are also getting similar issues which are just fixed by retrying.
just to add to the party, another error
time="2020-10-07T01:01:38Z" level=error msg="Error listing items" error="etcdserver: leader changed" error.file="/github.com/vmware-tanzu/velero/pkg/backup/item_collector.go:234" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemCollector).getResourceItems" group=storage.k8s.io/v1 logSource="pkg/backup/item_collector.go:234" namespace= resource=volumeattachments
Hello Velero team (@carlisia),
We saw this issue and want to have a go at implementing the request. It appears to be a pretty straightforward request:
spec.maxRetries field to the Backups CRDstatus.retryAttempts field to the Backups CRDNew if it is in the state PartiallyFailed and increment the retryAttempts until it reaches maxRetriesDo these sound like reasonable changes?
Thanks,
Clay and @astrieanna
Hello again Velero team,
@astrieanna and I poked around at this some more, and we realized it was not as simple as we thought.
As a small recap, we got it to the point where it would make a second attempt, but then the retry fails on the second attempt because the backup already exists in the object storage.
Here, we realized we just aren't familiar enough with the backup process yet 馃槄
With partial failures, is it okay that a backup already exists? Or should we clean everything up and try again? Do you have different kinds of partial failures that would mean different things?
Closing because workaround is to rerun the backup.
@eleanor-millman Yes, that's a workaround.. but having a workaround doesn't actually negate the feature request. Manually retrying a backup requires human intervention. This feature request is to allow the user to tell Velero, "in the case that this fails, please retry up to X many times".
On some of our larger clusters, retrying a backup from scratch can take many minutes. I'd rather Velero spend a few seconds retrying the upload.
@cblecker Agreed that the workaround isn't as good as having the feature implemented. Unfortunately, the Velero team is stretched way too thin to implement most of the feature requests that come in. Our issue backlog had grown to 411 issues as of last week and we felt that that wasn't great for the community, both for requesters like you who have your requests sitting in limbo and for Velero developers who have this mountain of issues to stare at. We are currently triaging them into issues we hope to tackle in 1.7/1.8 (i.e. in 2021), issues we hope to tackle in 2.0 (sometime in 2022 and perhaps 2023) and then the rest. We are closing the rest for now, since we feel like issues that won't be tackled for 2 years are not helped by sitting open, in limbo. While not a perfect decision mechanism, one of the things that lead us to prioritize some issues over others was if a workaround existed.
That being said, this triage is regarding work that the maintainers and active contributers will be doing. If you or anyone else who commented on this issue are able to make a PR, we are delighted to chat with you about possible design of the feature and are trying to prioritize PR review (something that we were falling behind on with too much on our plates). So please let us know if this is something that interests you and if you want a suggestion on how to implement this.
Most helpful comment
@eleanor-millman Yes, that's a workaround.. but having a workaround doesn't actually negate the feature request. Manually retrying a backup requires human intervention. This feature request is to allow the user to tell Velero, "in the case that this fails, please retry up to X many times".