Velero: Unable to delete backup if cloud resources have already been deleted

Created on 9 Feb 2018 · 8Comments · Source: vmware-tanzu/velero

If, for whatever reason, the cloud snapshots associated with a backup have already been deleted, ark backup delete and the GC controller will never be able to delete the backup.

The GC controller currently requires that the following deletions succeed before it will delete the backup (by removing our finalizer):

volume snapshots
backup directory & files in object storage
associated restores

An initial improvement could be to treat "not found" errors as non-errors, when trying to delete the items in the above list.

Bug P1 - Important

Source

ncdc

Most helpful comment

I'm facing the same problem with velero v1.5.1
The backup is in "Deleting" state since days and in the logs:

    level=error msg="Error in syncHandler, re-adding item to queue" controller=backup-deletion error="error downloading backup: error copying Backup to temp file: rpc error: code = Unknown desc = storage: service returned error: StatusCode=404, ErrorCode=BlobNotFound, ErrorMessage=The specified blob does not exist.\nRequestId:75a99b8d-f01e-004e-175c-963ab1000000\nTime:2020-09-29T12:30:26.7728994Z, RequestInitiated=Tue, 29 Sep 2020 12:30:26 GMT, RequestId=75a99b8d-f01e-004e-175c-963ab1000000, API Version=2018-03-28, QueryParameterName=, QueryParameterValue=" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:558" error.function=github.com/vmware-tanzu/velero/pkg/controller.downloadToTempFile key=velero/velero-s-24h-20200826022813-766pb logSource="pkg/controller/generic_controller.go:140"

titou10titou10 on 29 Sep 2020

👍2

All 8 comments

Another thought: we could add ark backup delete --force that would either just issue a delete, or if the backup already has a finalizer, just patch it to remove the finalizer, thus avoiding GC.

ncdc on 9 Feb 2018

👍1

--force is a very good idea to not check for cloud resources (since they're gone) and just delete the backup reference from ark.

Would an 'if force=true then ignore' condition here work? https://github.com/heptio/ark/blob/master/pkg/controller/gc_controller.go#L130

dogopupper on 12 Feb 2018

@dogopupper the code you linked to is in a controller in the ark server, so it can't easily react to command line flags from ark backup delete. But we could put the --force check inside the client code that deletes the backup.

ncdc on 12 Feb 2018

I did some research on the gRPC plugin boundary and error handling (since the GC Controller runs in the Ark server and the calls to delete disk snapshots and backup tarballs run in plugins). If we want to do something like this:

func (bs *BlockStore) DeleteSnapshot(...) error {
  return myCloudProvider.DeleteSnapshot(...)
}

if the error is non-nil, when it crosses the gRPC boundary, all that's seen by the Snapshot Service and GC Controller running in the Ark server is something like this:

rpc error: code = Unknown desc = googleapi: Error 404: The resource 'projects/andy-heptio/global/snapshots/gke-cluster-1-97c08e21-pvc-e83b13c3-be30-4722-9806-15c7ee4a48ea' was not found, notFound

As you can see, it's a generic Unknown error code with a text description. That doesn't lend itself to readily to our being able to execute logic for not-found errors vs others, since we'd have to parse the error text, which is fragile.

gRPC does support multiple error codes, each with a unique meaning. We could, for example. say that any code running in a plugin must do something like this:

func (bs *BlockStore) DeleteSnapshot(...) error {
  err := myCloudProvider.DeleteSnapshot(...)
  myCloudProvider.IsNotFound(err) {
    // status.Errorf is the gRPC way to return a specific error code and description
    return status.Errorf(codes.NotFound, err.Error())
  }
  return err
}

I'm not super crazy about this, as it limits us to the list of predefined gRPC error codes and it makes it difficult to include a stack trace.

An alternative solution would be to investigate something like https://hackernoon.com/handling-errors-in-golang-grpc-and-go-kit-services-d0fa0a112449.

In the short term, we'll add --force, which will print out a warning requiring yes/no confirmation stating that this will immediately remove the backup, and cloud resources may be orphaned, along with --confirm to override the warning.

ncdc on 12 Feb 2018

👍1

@skriss will test this

rosskukulinski on 6 Aug 2018

We've already addressed this for volume snapshots, so that not-found snapshots are not a blocking error for backup deletion.

This should also be true for object storage, so missing objects don't block the deletion from going through. Testing status:

[x] GKE
[x] AWS
[x] Azure

skriss on 7 Aug 2018

OK, confirmed that on AWS/Azure/GKE that backup deletions go through even if the files in object storage are missing. Closing this issue as resolved.

skriss on 15 Aug 2018

I'm facing the same problem with velero v1.5.1
The backup is in "Deleting" state since days and in the logs:

    level=error msg="Error in syncHandler, re-adding item to queue" controller=backup-deletion error="error downloading backup: error copying Backup to temp file: rpc error: code = Unknown desc = storage: service returned error: StatusCode=404, ErrorCode=BlobNotFound, ErrorMessage=The specified blob does not exist.\nRequestId:75a99b8d-f01e-004e-175c-963ab1000000\nTime:2020-09-29T12:30:26.7728994Z, RequestInitiated=Tue, 29 Sep 2020 12:30:26 GMT, RequestId=75a99b8d-f01e-004e-175c-963ab1000000, API Version=2018-03-28, QueryParameterName=, QueryParameterValue=" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:558" error.function=github.com/vmware-tanzu/velero/pkg/controller.downloadToTempFile key=velero/velero-s-24h-20200826022813-766pb logSource="pkg/controller/generic_controller.go:140"

titou10titou10 on 29 Sep 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings