Velero: Avoid deleting backup files from backup storage

Created on 13 Mar 2018  路  13Comments  路  Source: vmware-tanzu/velero

I'm playing around with ark to protect our cluster in case of a disaster. I followed the examples for setting up in the google cloud and created a Schedule to backup hourly. With that in place I wanted to verify what happens when an operator deletes the cluster or necessary resources by accident manually. I tested such a scenario by deleting the heptio-ark namespace containing the CRD objects. It took me quite by surprise that deleting the namespace also emptied the google buckets containing the actual backup files.

I guess this would also happen when someone deletes the cluster? Is there a supported way in ark to avoid deleting any files from the backup storages or do you suggest another approach to protect against operator errors?

I'm using ark 0.7.1.

Bug P0 - Hair on fire

Most helpful comment

Huge proponent of explicit over implicit in general. Implicit deletes are almost never a good idea -- but specially when we're talking about backups and data volumes, IMO.

All 13 comments

If you do anything that results in the deletion of a backup from Kubernetes, such as:

  1. ark backup delete my-backup
  2. kubectl -n heptio-ark delete backup/my-backup
  3. kubectl delete ns/heptio-ark

this will trigger the full backup deletion process, including removing backup tarballs from object storage, persistent disk snapshots created by ark, and any associated restores.

We have an RFE to mark backups so they aren't deleted when they're past their expiration date (#251), but that would only affect expiration, not deletion.

We added the backup deletion functionality in v0.7.0. Prior to that, you had to manually delete everything from backup storage and use kubectl to delete the backup from Kubernetes. We had several people ask for a way to delete backups, so we added this to v0.7.0.

We are working on designing a new feature called "backup targets" that is going to allow for backup data to be replicated to multiple backup storage destinations. Perhaps we could evolve the deletion aspect as part of this.

Do you have any suggestions for how to make this better while still allowing people to delete backups?

cc @jbeda @skriss

I think it would be good to have an option to disallow any deletion from the bucket storage. Cleaning up the backup files can be configured by TTLs of the bucket itself (at least in GCP, but I guess this is true for AWS/Azure as well) and it protects against manual operator errors in regards to using ark or kubectl. AFAIU ark should be able to recover from just that.

Making this option opt-in would still allow people to usually delete their backups manually. WDYT, does that make sense in the broader view of ark as well?

I think we can address #376 as well as this if we change the logic for how we delete backups.

Right now we use a finalizer so that when ark backup delete and kubectl delete backup issue a DELETE request, it sets the deletionTimestamp and waits for the finalizer to be removed. The Ark server sees the deleted backup, deletes everything associated with the backup (files in object storage, PV snapshots, restores in kube/etcd), and then removes the finalizer, which allows the kube apiserver to finally delete it.

As I pointed out above, this happens for ark backup delete, kubectl delete backup, and kubectl delete namespace/heptio-ark.

Instead of this, we can have ark backup delete POST a new DeleteBackupRequest resource. The Ark server code that currently watches for backups with a deletionTimestamp will instead watch for DeleteBackupRequest objects, then perform the deletion logic.

If you kubectl delete -n heptio-ark backup/foo, it will only remove the backup from kube/etcd. The next time the backup sync controller runs its sync loop, assuming foo is still in object storage, the backup resource will be readded to kube/etcd.

@skriss @rdodev @rbankston @jpweber @jbeda @nrb wdyt?

Seeing as this affects PV snapshots as well, I'm all for erring on the side of having more data mysteriously show up than unexpectedly deleted. 馃憤

We have a couple of options for how this is implemented and what kind of output we show to the user. This is what kubectl does when deleting multiple items:

$ kubectl delete configmap -l x=y
configmap "a1" deleted
configmap "a2" deleted
configmap "a3" deleted
configmap "a4" deleted
configmap "a5" deleted

kubectl issues a GET request to retrieve all items matching the label selector x=y. For each item, it issues an individual DELETE request and prints that the item was deleted.

For Ark, we'll be POSTing a DeleteBackupRequest. Here is where we have a choice:

  1. GET all matching backups and then send 1 request per backup, just like kubectl does above
  2. Send 1 request containing either a list of backup names, or the label selector, and let the Ark server figure out what to delete. Instead of printing out each backup to be deleted, it would print out a more generic Request to delete backup(s) sent to the Ark server. Check to see if they've been deleted by running 'ark backup get'.

I think I am leaning toward option 1. We can record the deletion status on each DeleteBackupRequest and the user could inspect that if desired.

Any strong feelings one way or another?

One way to think about this is perhaps as separate controllers. Have one controller responsible for creating backups and another for deleting/GCing backups. Having clear semantics about how these hand off would help make ark more flexible and this stuff more obvious.

Separate question -- if you delete a backup through kubectl (and don't POST the DeleteBackupRequest) won't ark just scan the storage and recreate the backup?

It seems to me that there is really 2 modes here:

  • Backup bucket is plan of record -- In the resources in the cluster are a shadow of what is going on in the bucket. Interactions with the shadow don't impact the real stuff unless explicit. The "right" way to model deleting a backup would be to delete the underlying bucket storage and let the shadow reflect it.
  • Kube resources are plan of record -- in this scheme if it doesn't exist in k8s then it shouldn't exist in the bucket. Deleting the resource naturally deletes the bucket.

I think the problem is that we aren't clear about the modal nature and are trying to have it both ways.

Brainstorming a model here:

  • A backup-read-sync controller -- this reads from a bucket and puts stuff in a k8s cluster.
  • A backup-finalizer controller -- this looks for finalizers and deletes stuff from a bucket
  • A set of tools to perhaps run the read-sync in a one-shot mode or to manage buckets directly. Not sure about creds for those.

You'd run these controllers in different configurations to get the modes above. Would also provide a way to fix things up or bootstrap stuff with one shot tools.

Let me know if you want to get together and walk through this. Or if I'm just crazy :)

We do have separate controllers:

  1. A backup controller that creates backups
  2. A GC controller to handle deletions, either because the user explicitly asked for a deletion, or because a backup has expired based on its TTL
  3. A backup sync controller that reads from the bucket and creates the backup object in kube/etcd if it doesn't exist

Our operating mode is the bucket is the plan of record.

We have had a request for the ability to run the backup sync in a one shot mode (#104) but we haven't implemented it yet.

I'm suggesting we change _how_ we implement a user-requested delete. Instead of using a finalizer and sending a DELETE backup request, we get rid of the finalizer and send a POST delete-backup request.

The alternative is we remove backup deletion entirely from the Ark server & cli and instead have some other external tool or helper to deal with this. I still think I prefer the delete-backup request if that works for you.

Having data in the bucket only deleted when you do an explicit ark backup delete or kubectl delete backups.ark.heptio.com ... sounds reasonable to me.

The problem with kubectl delete backup/foo is that that is effectively what is called when you delete a namespace. The only way to fix this is to make it so DELETE /apis/ark.heptio.com/v1/namespaces/heptio-ark/backups/foo does not delete from object storage. Instead, we'll do POST /apis/ark.heptio.com/v1/namespaces/heptio-ark/deletebackuprequests (and this is what ark backup delete will do).

I like the DeleteBackupRequest model, and displaying each backup to be deleted individually. IMO, it's worth being explicit here so as to reduce user surprises.

We do have the bucket as a source of truth documented at https://github.com/heptio/ark/blob/master/docs/about.md#object-storage-sync. Perhaps that could be surfaced better.

Huge proponent of explicit over implicit in general. Implicit deletes are almost never a good idea -- but specially when we're talking about backups and data volumes, IMO.

Hey @ncdc and fellows, big thanks from my side for taking my concerns seriously, and providing great support for a great tool!! :cake:

Thanks @marco-jantke!!!

Was this page helpful?
0 / 5 - 0 ratings