Velero: Restic restore stuck InProcess while restoring PV with "volumeBindingMode: WaitForFirstConsumer" storage class

Created on 24 Sep 2020  路  2Comments  路  Source: vmware-tanzu/velero

What steps did you take and what happened:

  1. Create storage class with volumeBindingMode: Immediate.
  2. Create sample workload with PV using this storage class.
  3. Write some data to it.
  4. Take a backup.
  5. Remove namespace with sample workload.
  6. Restore a backup.

What's happening then:

  • Restore object is stuck forever (or very long) in InProgress state
  • workload starts, but no init container gets injected, so no data has been restored.
  • Velero logs says restic restore action has run.
  • PodVolumeRestore object gets created, but it's state is never updated.
  • Restic logs does not say anything on info log level.

What did you expect to happen:
Data in PV should be restored from backup.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

As a workaround, one can create a copy of used storageclass and use https://velero.io/docs/v1.5/restore-reference/#changing-pvpvc-storage-classes feature to use modified one, which has volumeBindingMode: Immediate.

Volumes are being provisioned using https://github.com/hetznercloud/csi-driver.

Restoring volumes with volumeBindingMode: Immediate works well.

Environment:

  • Velero version (use velero version): Tried 1.4.2 and 1.5.1 with the same result
  • Velero features (use velero client config get features): features: <NOT SET>
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"archive", BuildDate:"2020-09-18T18:46:38Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:32:58Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: Flexkube v0.4.3
  • Cloud provider or hardware configuration: hcloud
  • OS (e.g. from /etc/os-release): Flatcar stable

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • :+1: for "I would like to see this bug fixed as soon as possible"
  • :-1: for "There are more important bugs to focus on right now"
Bug Restic Restic - GA Reviewed Q2 2021

Most helpful comment

Thanks for this report. I think I see the issue, and it's an order of operations one - Velero is trying to recreate the PV, PVC, and Pod (in that order), but when in a WaitForFirstConsumer binding mode, this isn't sufficient.

I'm going to log this as a high priority bug, because it's not a unique use case, but I don't have an answer for it at the moment.

All 2 comments

Thanks for this report. I think I see the issue, and it's an order of operations one - Velero is trying to recreate the PV, PVC, and Pod (in that order), but when in a WaitForFirstConsumer binding mode, this isn't sufficient.

I'm going to log this as a high priority bug, because it's not a unique use case, but I don't have an answer for it at the moment.

We've been looking at this as well for other use cases. A long term solution would be to use the proposed Data Populators (https://github.com/kubernetes/enhancements/issues/1495) but this will require changes in how Restic is handled.

Was this page helpful?
0 / 5 - 0 ratings