Velero: GCP: Backup marked completed, even though snapshots are not ready yet

Created on 21 Feb 2020  路  11Comments  路  Source: vmware-tanzu/velero

What steps did you take and what happened:
In our automation, I am creating backups on one cluster and then restore them on another within the same GCP project.
To backup, I use:

velero backup create mybackup --storage-location gcp-s3 --include-namespaces mynamespace,myothernamespace --include-resources persistentvolumes --include-cluster-resources --snapshot-volumes --ttl 24h --wait

After some time the backup completes. However, when I check GCP -> Compute -> Snapshots, I see that not all snapshots are ready yet.
When restoring this 'completed' backup, I get some error messages,
so in my automation I included a step where I check all snapshots from the backup and only continue when all snapshots are marked as ready.

What did you expect to happen:
When I pass --wait to the backup command I expect the backup to be usable and all involved steps to be done before reporting a 'completed' status.

The output of the following commands will help us better understand what's going on:
velero restore describe myrestore --details

Name:         myrestore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs myrestore' for more information)

Errors:
  Velero:     <none>
  Cluster:  error executing PVAction for persistentvolumes/pvc-442c6854-971d-4c0b-acff-ca4e840ccf0e: rpc error: code = Unknown desc = googleapi: Error 400: The resource 'projects/my-gcp-project/global/snapshots/cluster-pvc-ab9b674c-601f-411e-b565-4b244f1c9ce1' is not ready, resourceNotReady
  Namespaces: <none>

Backup:  mybackup

Namespaces:
  Included:  mynamespace, myothernamespace
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Environment:

  • Velero version (use velero version):
Client:
    Version: v1.2.0
    Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
    Version: v1.2.0 
  • Velero features (use velero client config get features): None
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.7-gke.23", GitCommit:"06e05fd0390a51ea009245a90363f9161b6f2389", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:45Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: GKE
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Ubuntu 18.04
Bug Needs Product P2 - Long-term important Reviewed Q2 2021

All 11 comments

@boxcee yeah we're aware of this issue - see https://github.com/vmware-tanzu/velero/issues/1799 for another report (on AWS, but same issue).

ideally we'd model this as an additional phase on the backup, to indicate that the snapshots have been created but are not yet ready.

Why not flag it as 'BackupInProgress' ?

I think it'd be useful to differentiate between "we're actively scraping the API to create this backup" vs. "we're waiting for the storage system to finish moving the snapshot data to durable storage" - more clear for users as to what's going on, and also likely makes some things easier on the back end (e.g. not blocking the backup controller queue if we're just waiting for snapshots to be ready).

But, all of this probably needs some more thought and design. If you're interested in working on this, we're happy to provide feedback.

Hm, I am only concerned when it comes to using the --wait flag. From a user point of view I expect the backup to be finished when I explicitly add the --wait flag, but it isn't. The backup is not really 'Completed', only parts are.

Anyway, I am interested in finding a solution for this. Will spend some time looking into this.

Ah, another thing I experienced, which is indirectly related.

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Am trying to create a setup and test this further.

When the storage quotas are reached and a backup and snapshots are being created the snapshots will silently fail.

Possibly related to #2212 or #2255?

@skriss btw regarding https://github.com/vmware-tanzu/velero/issues/1799 -- while we initially found it on AWS, the same problem showed up on Azure and GCP.

@skriss I know we'd discussed options around doing this before, but we ended up fixing it in our fork (temporarily). I'm happy to remove our local commits if an upstream solution is found. Here's what we put in place: https://github.com/konveyor/velero-plugin-for-aws/pull/2
https://github.com/konveyor/velero-plugin-for-gcp/pull/2
https://github.com/konveyor/velero-plugin-for-microsoft-azure/pull/2

Nice @sseago, unfortunately we don't want to use our own Fork so we want to place the code outside of Velero until Velero supports it.

@skriss do you know how we could get the Volume Snapshot ID to check in GKE without hooking into Velero directly?

If we run:

  backup, err := b.vc.VeleroV1().Backups(backup.Namespace).Get(backup)

I think we only get the BackupSpec which I don't think has the VolumeSnapshot ID in GKE. Can you point me in the right direction with the API to get the ID so we can monitor it ourselves before running a restore?

Should be covered by #3533.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

doronmak picture doronmak  路  3Comments

Berndinox picture Berndinox  路  3Comments

Marki4711 picture Marki4711  路  3Comments

akgunjal picture akgunjal  路  3Comments

abh picture abh  路  4Comments