Velero: Removal of expired backups does not work

Created on 18 Nov 2020 · 7Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:

In our AWS-based setup, when scheduled backup reach their TTL, the deletion process is started but gets stuck in status Deleting. The contents in S3 for the bucket are properly deleted, while volume snapshots stay (causing significant extra cost).

What did you expect to happen:

I expect backups to be cleanly removed when their TTL expires, including all backed up data, such as volume snapshots.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero

time="2020-11-18T08:23:47Z" level=info msg="Removing existing deletion requests for backup" backup=velero-nightly-backup-20201103034355 controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:469" name=velero-nightly-backup-20201103034355-gt6b9 namespace=velero
time="2020-11-18T08:23:50Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=backup-deletion error="error downloading backup: error copying Backup to temp file: rpc error: code = Unknown desc = error getting object backups/velero-nightly-backup-20201103034355/velero-nightly-backup-20201103034355.tar.gz: NoSuchKey: The specified key does not exist.\n\tstatus code: 404, request id: 01DBEB5FABBF40BD, host id: HKe3B0heM0NpUhxXbLEZp7THCXtsfDKJkYdR6Sg0bS3+j0ywshitElmEnG7mPdDNmq6ASEtKT6w=" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:558" error.function=github.com/vmware-tanzu/velero/pkg/controller.downloadToTempFile key=velero/velero-nightly-backup-20201103034355-gt6b9 logSource="pkg/controller/generic_controller.go:140"

velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml

Name:         velero-nightly-backup-20201103034355
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Tiller
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-2.0.3
              velero.io/schedule-name=velero-nightly-backup
              velero.io/storage-location=aws
Annotations:  <none>

Phase:  Deleting

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  aws

Velero-Native Snapshot PVs:  auto

TTL:  360h0m0s

Hooks:  <none>

Backup Format Version:

Started:    2020-11-03 04:43:55 +0100 CET
Completed:  2020-11-03 04:52:17 +0100 CET

Expiration:  2020-11-18 04:43:55 +0100 CET

Velero-Native Snapshots:  2 of 2 snapshots completed successfully (specify --details for more information)

Deletion Attempts:
  2020-11-18 06:30:25 +0100 CET: InProgress

velero backup logs <backupname>

Logs for backup "velero-nightly-backup-20201103034355" are not available until it's finished processing. Please wait until the backup has a phase of Completed or Failed and try again.

Anything else you would like to add:

My guess is that this at least loosely related to https://github.com/vmware-tanzu/velero/pull/2993.

Environment:

Velero version:

Client:
    Version: v1.5.2
    Git commit: -
Server:
    Version: v1.5.2

Velero features:

features: <NOT SET>

Kubernetes version:
1.18.9

Kubernetes installer & version:
kops 1.18.1

Cloud provider or hardware configuration:
AWS (with aws plugin)

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

Bug Duplicate Reviewed Q2 2021

Source

djablonski-moia

👍1

All 7 comments

This is definitely a bug.
The correct fix is to not delete the Backup from the object store when there are errors.
In this case, there might have been errors deleting volumesnpshots at https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_deletion_controller.go#L349
And these errors are ignored to eventually delete the backup from the object store at https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_deletion_controller.go#L362

The correct fix should be to move the

    if backupStore != nil {
        log.Info("Removing backup from backup storage")
        if err := backupStore.DeleteBackup(backup.Name); err != nil {
            errs = append(errs, err.Error())
        }
    }

into the

    if len(errs) == 0 {
        // Only try to delete the backup object from kube if everything preceding went smoothly
        err = c.backupClient.Backups(backup.Namespace).Delete(context.TODO(), backup.Name, metav1.DeleteOptions{})
        if err != nil {
            errs = append(errs, errors.Wrapf(err, "error deleting backup %s", kube.NamespaceAndName(backup)).Error())
        }
    }

at https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_deletion_controller.go#L410

cc @zubron because you are authoring PR #2993

ashish-amarnath on 19 Nov 2020

I am closing this as a duplicate of #2980
I've copied a link to my above comment into my code review of #2993

ashish-amarnath on 19 Nov 2020

https://github.com/vmware-tanzu/velero/pull/2993#issuecomment-730440597
Based on this comment reopening this and work on a fix for this.

ashish-amarnath on 19 Nov 2020

I think that I'm experiencing this as well. Is it 'safe' to manually delete the backups.velero.io objects that seem to be 'stuck' deleting (e.g. k delete backups.velero.io -n velero velero-daily-backup-20201212060042)