Velero: GCP: Backup marked completed, even though snapshots are not ready yet

Created on 21 Feb 2020 · 11Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
In our automation, I am creating backups on one cluster and then restore them on another within the same GCP project.
To backup, I use:

velero backup create mybackup --storage-location gcp-s3 --include-namespaces mynamespace,myothernamespace --include-resources persistentvolumes --include-cluster-resources --snapshot-volumes --ttl 24h --wait

After some time the backup completes. However, when I check GCP -> Compute -> Snapshots, I see that not all snapshots are ready yet.
When restoring this 'completed' backup, I get some error messages,
so in my automation I included a step where I check all snapshots from the backup and only continue when all snapshots are marked as ready.

What did you expect to happen:
When I pass --wait to the backup command I expect the backup to be usable and all involved steps to be done before reporting a 'completed' status.

The output of the following commands will help us better understand what's going on:
velero restore describe myrestore --details

Name:         myrestore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs myrestore' for more information)

Errors:
  Velero:     <none>
  Cluster:  error executing PVAction for persistentvolumes/pvc-442c6854-971d-4c0b-acff-ca4e840ccf0e: rpc error: code = Unknown desc = googleapi: Error 400: The resource 'projects/my-gcp-project/global/snapshots/cluster-pvc-ab9b674c-601f-411e-b565-4b244f1c9ce1' is not ready, resourceNotReady
  Namespaces: <none>

Backup:  mybackup

Namespaces:
  Included:  mynamespace, myothernamespace
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Environment:

Velero version (use velero version):

Client:
    Version: v1.2.0
    Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
    Version: v1.2.0

Velero features (use velero client config get features): None
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.7-gke.23", GitCommit:"06e05fd0390a51ea009245a90363f9161b6f2389", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:45Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: GKE
Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release): Ubuntu 18.04

Bug Needs Product P2 - Long-term important Reviewed Q2 2021

Source

boxcee

All 11 comments

@boxcee yeah we're aware of this issue - see https://github.com/vmware-tanzu/velero/issues/1799 for another report (on AWS, but same issue).

skriss on 21 Feb 2020

👍1

ideally we'd model this as an additional phase on the backup, to indicate that the snapshots have been created but are not yet ready.

skriss on 21 Feb 2020

Why not flag it as 'BackupInProgress' ?

boxcee on 21 Feb 2020

I think it'd be useful to differentiate between "we're actively scraping the API to create this backup" vs. "we're waiting for the storage system to finish moving the snapshot data to durable storage" - more clear for users as to what's going on, and also likely makes some things easier on the back end (e.g. not blocking the backup controller queue if we're just waiting for snapshots to be ready).

But, all of this probably needs some more thought and design. If you're interested in working on this, we're happy to provide feedback.

skriss on 21 Feb 2020

Hm, I am only concerned when it comes to using the --wait flag. From a user point of view I expect the backup to be finished when I explicitly add the --wait flag, but it isn't. The backup is not really 'Completed', only parts are.

Anyway, I am interested in finding a solution for this. Will spend some time looking into this.

boxcee on 24 Feb 2020

👍1

Ah, another thing I experienced, which is indirectly related.

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Am trying to create a setup and test this further.

boxcee on 24 Feb 2020

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Possibly related to #2212 or #2255?

skriss on 24 Feb 2020

@skriss btw regarding https://github.com/vmware-tanzu/velero/issues/1799 -- while we initially found it on AWS, the same problem showed up on Azure and GCP.

sseago on 2 Mar 2020

👍1

@skriss I know we'd discussed options around doing this before, but we ended up fixing it in our fork (temporarily). I'm happy to remove our local commits if an upstream solution is found. Here's what we put in place: https://github.com/konveyor/velero-plugin-for-aws/pull/2
https://github.com/konveyor/velero-plugin-for-gcp/pull/2
https://github.com/konveyor/velero-plugin-for-microsoft-azure/pull/2

sseago on 2 Mar 2020

Nice @sseago, unfortunately we don't want to use our own Fork so we want to place the code outside of Velero until Velero supports it.

@skriss do you know how we could get the Volume Snapshot ID to check in GKE without hooking into Velero directly?

If we run:

  backup, err := b.vc.VeleroV1().Backups(backup.Namespace).Get(backup)

I think we only get the BackupSpec which I don't think has the VolumeSnapshot ID in GKE. Can you point me in the right direction with the API to get the ID so we can monitor it ourselves before running a restore?