Velero: Cross Cluster Backup and Restore problem with Rook PVs

Created on 15 Jun 2018  Â·  14Comments  Â·  Source: vmware-tanzu/velero

Following on from the conversation on slack (here, here, here and here);

The 10k foot view is that I have a cluster with Rook volumes in a couple of namespaces. I'd need to update the cluster, so have created a new cluster beside it with the appropriate changes (AWS deployed, self managed, not EKS), and I need to migrate the rook volumes and state in 2 namespaces to the other cluster.

Ark works wonderfully for everything except the Rook PVs. And while I'm at it, let me thank you for creating Ark. I spent a few days researching ways to backup and restore cluster state, and for me at least, Ark is a clear winner.

What I tried:

  • The Ark Rook plugin.
    This looked to work (as in there were no errors), but PV data wasn't restored. On closer inspection the data was never persisted to S3, but snapshots were created on the rook-api pod. I don't currently have a way of getting the snapshots into and out of S3.
  • @ncdc then pointed me to the new Restic daemonset in 0.9.0-alpha-2.
    I gave this a shot, and it works fine (I've got a backup, will check restore today) for PV's attached to active pods. The trip up came when I discovered that one of our deployments maintains state in PV's for transient pods. The pods come and go when people log in to and out of an application (JuypyterHub). So at any given time, none of the PV's may be attached to a pod, making backup with Restic impossible.

And that's where I'm at.

Looking around I discovered backy2, so perhaps this in conjunction with the Ark Rook plugin may be successful. I'll let you know if I have any luck. Any other suggestions gratefully welcomed.

Bug P1 - Important

Most helpful comment

@pms1969 I was able to reproduce this issue and I see what the problem is. During restore, Ark is not properly waiting for the PV/PVC to be created and mounted before attempting to restore the contents of the volume using restic. I'll start working on a fix for this and hope to get it out to you ASAP. Thanks for testing and reporting!!

All 14 comments

Following up. The restore with restic failed:

$ ark restore describe prod-tools4-20180615165209
Name:         prod-tools4-20180615165209
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>
Backup:  prod-tools4
Namespaces:
  Included:  *
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io
  Cluster-scoped:  auto
Namespace mappings:  <none>
Label selector:  <none>
Restore PVs:  auto
Phase:  Completed
Validation errors:  <none>
Warnings:
  Ark:        <none>
  Cluster:  not restored: clusterinformations.crd.projectcalico.org "default" already exists and is different from backed up version.
            not restored: clusterrolebindings.rbac.authorization.k8s.io "cert-manager-certs" already exists and is different from backed up version.
            not restored: clusterroles.rbac.authorization.k8s.io "cert-manager-certs" already exists and is different from backed up version.
            not restored: clusterroles.rbac.authorization.k8s.io "prometheus" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "certificates.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "clusterissuers.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "issuers.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: felixconfigurations.crd.projectcalico.org "default" already exists and is different from backed up version.
            not restored: ippools.crd.projectcalico.org "default-ipv4-ippool" already exists and is different from backed up version.
  Namespaces:
    prod-tools:  not restored: serviceaccounts "default" already exists and is different from backed up version.
Errors:
  Ark:      pod volume restore failed: error restoring volume: error identifying path of volume: expected one matching path, got 0
  Cluster:    <none>
  Namespaces: <none>

The error I get from the dashboard is:

MountVolume.SetUp failed for volume "pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2: failed to map image replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2 cluster rook-system. failed to map image replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2: Failed to complete 'rbd': exit status 2. . output: rbd: sysfs write failed In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (2) No such file or directory

The restic log looks like this:

time="2018-06-15T15:29:42Z" level=info msg="Setting log-level to INFO"
time="2018-06-15T15:29:42Z" level=info msg="Starting Ark restic server v0.9.0-alpha.2" logSource="pkg/cmd/cli/restic/server.go:42"
time="2018-06-15T15:29:42Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/restic/server.go:112"
time="2018-06-15T15:29:42Z" level=info msg="Controllers started successfully" logSource="pkg/cmd/cli/restic/server.go:150"
time="2018-06-15T15:29:42Z" level=info msg="Starting controller" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:77"
time="2018-06-15T15:29:42Z" level=info msg="Waiting for caches to sync" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:80"
time="2018-06-15T15:29:42Z" level=info msg="Starting controller" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:77"
time="2018-06-15T15:29:42Z" level=info msg="Waiting for caches to sync" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:80"
time="2018-06-15T15:29:42Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:84"
time="2018-06-15T15:29:42Z" level=info msg="Caches are synced" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:84"
time="2018-06-15T15:52:17Z" level=error msg="Unable to get item's pod prod-tools/mongo-mongodb-68bfb98d5c-wbl58, not enqueueing." controller=pod-volume-restore error="pod \"mongo-mongodb-68bfb98d5c-wbl58\" not found" key=heptio-ark/prod-tools4-20180615165209-7t9g5 logSource="pkg/controller/pod_volume_restore_controller.go:194"

Not quite what I expected. :slightly_frowning_face:

None-the-less, I'm going to persist trying to get something to work for me. :/

@skriss could you please assist?

@pms1969 could you provide the output of ark restore logs prod-tools4-20180615165209? Thanks!

@skriss Will do. In the middle of a production problem right now, but will get it for you as soon as I'm done with it.

ark-restore.log
@skriss log attached. Not that it reveals much.

Are any of the volumes you're trying to back up with restic hostPath?

On Thu, Jun 21, 2018 at 5:25 AM Paul Saunders notifications@github.com
wrote:

ark-restore.log
https://github.com/heptio/ark/files/2122935/ark-restore.log
@skriss https://github.com/skriss log attached. Not that it reveals
much.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/ark/issues/556#issuecomment-399036788, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAABYpQ4inRPL9lwxmUg47619dskyojuks5t-2aKgaJpZM4UpG2r
.

@ncdc not sure what that means :(

There are a few, but the one in question is created via the mongodb helm chart

my relevant values for the PVC are:

persistence:
  enabled: true
  storageClass: "rook-block"
  accessMode: ReadWriteOnce
  size: 8Gi

Ok, so the volumes you'd specified in the backup.ark.heptio.com/backup-volumes pod annotation are all Rook PVCs?

Yes

@pms1969 I was able to reproduce this issue and I see what the problem is. During restore, Ark is not properly waiting for the PV/PVC to be created and mounted before attempting to restore the contents of the volume using restic. I'll start working on a fix for this and hope to get it out to you ASAP. Thanks for testing and reporting!!

@skriss no, no.. Thank you.

@skriss adding to the v0.9.0 milestone, please let me know if that doesn't seem right to you.

@rosskukulinski yes. there were actually a few issues at play here but all should be resolved with the next alpha/beta.

The issues that were at play here should all be resolved now in master, so I'm going to close this issue out, but feel free to reopen or open a new one as needed. We should be putting out a new tagged alpha/beta shortly for testing!

Was this page helpful?
0 / 5 - 0 ratings