Velero: Cross Cluster Backup and Restore problem with Rook PVs

Created on 15 Jun 2018 · 14Comments · Source: vmware-tanzu/velero

Following on from the conversation on slack (here, here, here and here);

The 10k foot view is that I have a cluster with Rook volumes in a couple of namespaces. I'd need to update the cluster, so have created a new cluster beside it with the appropriate changes (AWS deployed, self managed, not EKS), and I need to migrate the rook volumes and state in 2 namespaces to the other cluster.

Ark works wonderfully for everything except the Rook PVs. And while I'm at it, let me thank you for creating Ark. I spent a few days researching ways to backup and restore cluster state, and for me at least, Ark is a clear winner.

What I tried:

The Ark Rook plugin.
This looked to work (as in there were no errors), but PV data wasn't restored. On closer inspection the data was never persisted to S3, but snapshots were created on the rook-api pod. I don't currently have a way of getting the snapshots into and out of S3.
@ncdc then pointed me to the new Restic daemonset in 0.9.0-alpha-2.
I gave this a shot, and it works fine (I've got a backup, will check restore today) for PV's attached to active pods. The trip up came when I discovered that one of our deployments maintains state in PV's for transient pods. The pods come and go when people log in to and out of an application (JuypyterHub). So at any given time, none of the PV's may be attached to a pod, making backup with Restic impossible.

And that's where I'm at.

Looking around I discovered backy2, so perhaps this in conjunction with the Ark Rook plugin may be successful. I'll let you know if I have any luck. Any other suggestions gratefully welcomed.

Bug P1 - Important

Source

pms1969

Most helpful comment

@pms1969 I was able to reproduce this issue and I see what the problem is. During restore, Ark is not properly waiting for the PV/PVC to be created and mounted before attempting to restore the contents of the volume using restic. I'll start working on a fix for this and hope to get it out to you ASAP. Thanks for testing and reporting!!

skriss on 21 Jun 2018

👍2

All 14 comments

Following up. The restore with restic failed:

$ ark restore describe prod-tools4-20180615165209
Name:         prod-tools4-20180615165209
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>
Backup:  prod-tools4
Namespaces:
  Included:  *
  Excluded:  <none>
Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io
  Cluster-scoped:  auto
Namespace mappings:  <none>
Label selector:  <none>
Restore PVs:  auto
Phase:  Completed
Validation errors:  <none>
Warnings:
  Ark:        <none>
  Cluster:  not restored: clusterinformations.crd.projectcalico.org "default" already exists and is different from backed up version.
            not restored: clusterrolebindings.rbac.authorization.k8s.io "cert-manager-certs" already exists and is different from backed up version.
            not restored: clusterroles.rbac.authorization.k8s.io "cert-manager-certs" already exists and is different from backed up version.
            not restored: clusterroles.rbac.authorization.k8s.io "prometheus" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "certificates.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "clusterissuers.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: customresourcedefinitions.apiextensions.k8s.io "issuers.certmanager.k8s.io" already exists and is different from backed up version.
            not restored: felixconfigurations.crd.projectcalico.org "default" already exists and is different from backed up version.
            not restored: ippools.crd.projectcalico.org "default-ipv4-ippool" already exists and is different from backed up version.
  Namespaces:
    prod-tools:  not restored: serviceaccounts "default" already exists and is different from backed up version.
Errors:
  Ark:      pod volume restore failed: error restoring volume: error identifying path of volume: expected one matching path, got 0
  Cluster:    <none>
  Namespaces: <none>

The error I get from the dashboard is:

MountVolume.SetUp failed for volume "pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2: failed to map image replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2 cluster rook-system. failed to map image replicapool/pvc-5377f68a-4fb0-11e8-aa2e-0690e07debf2: Failed to complete 'rbd': exit status 2. . output: rbd: sysfs write failed In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (2) No such file or directory

The restic log looks like this:

time="2018-06-15T15:29:42Z" level=info msg="Setting log-level to INFO"
time="2018-06-15T15:29:42Z" level=info msg="Starting Ark restic server v0.9.0-alpha.2" logSource="pkg/cmd/cli/restic/server.go:42"
time="2018-06-15T15:29:42Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/restic/server.go:112"
time="2018-06-15T15:29:42Z" level=info msg="Controllers started successfully" logSource="pkg/cmd/cli/restic/server.go:150"
time="2018-06-15T15:29:42Z" level=info msg="Starting controller" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:77"
time="2018-06-15T15:29:42Z" level=info msg="Waiting for caches to sync" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:80"
time="2018-06-15T15:29:42Z" level=info msg="Starting controller" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:77"
time="2018-06-15T15:29:42Z" level=info msg="Waiting for caches to sync" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:80"
time="2018-06-15T15:29:42Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:84"
time="2018-06-15T15:29:42Z" level=info msg="Caches are synced" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:84"
time="2018-06-15T15:52:17Z" level=error msg="Unable to get item's pod prod-tools/mongo-mongodb-68bfb98d5c-wbl58, not enqueueing." controller=pod-volume-restore error="pod \"mongo-mongodb-68bfb98d5c-wbl58\" not found" key=heptio-ark/prod-tools4-20180615165209-7t9g5 logSource="pkg/controller/pod_volume_restore_controller.go:194"

Not quite what I expected. :slightly_frowning_face:

None-the-less, I'm going to persist trying to get something to work for me. :/

pms1969 on 15 Jun 2018

@skriss could you please assist?

ncdc on 19 Jun 2018

@pms1969 could you provide the output of ark restore logs prod-tools4-20180615165209? Thanks!

skriss on 19 Jun 2018

@skriss Will do. In the middle of a production problem right now, but will get it for you as soon as I'm done with it.

pms1969 on 20 Jun 2018

👍1

ark-restore.log
@skriss log attached. Not that it reveals much.

pms1969 on 21 Jun 2018

Are any of the volumes you're trying to back up with restic hostPath?

On Thu, Jun 21, 2018 at 5:25 AM Paul Saunders notifications@github.com
wrote:

ark-restore.log
https://github.com/heptio/ark/files/2122935/ark-restore.log
@skriss https://github.com/skriss log attached. Not that it reveals
much.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/ark/issues/556#issuecomment-399036788, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAABYpQ4inRPL9lwxmUg47619dskyojuks5t-2aKgaJpZM4UpG2r
.

ncdc on 21 Jun 2018

@ncdc not sure what that means :(

There are a few, but the one in question is created via the mongodb helm chart

my relevant values for the PVC are:

persistence:
  enabled: true
  storageClass: "rook-block"
  accessMode: ReadWriteOnce
  size: 8Gi

pms1969 on 21 Jun 2018

Ok, so the volumes you'd specified in the backup.ark.heptio.com/backup-volumes pod annotation are all Rook PVCs?

ncdc on 21 Jun 2018

Yes

pms1969 on 21 Jun 2018

skriss on 21 Jun 2018

👍2

@skriss no, no.. Thank you.

pms1969 on 22 Jun 2018

@skriss adding to the v0.9.0 milestone, please let me know if that doesn't seem right to you.

rosskukulinski on 25 Jun 2018

@rosskukulinski yes. there were actually a few issues at play here but all should be resolved with the next alpha/beta.

skriss on 26 Jun 2018

The issues that were at play here should all be resolved now in master, so I'm going to close this issue out, but feel free to reopen or open a new one as needed. We should be putting out a new tagged alpha/beta shortly for testing!

skriss on 27 Jun 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings