What steps did you take and what happened:
I'm trying to restore a restic volume.
My backup got 2 volumes with 2 deployments
Backup
tools-bitbucket-backup-prv2 Completed 2019-10-18 08:45:44 +0200 CEST 29d cluster-tools <none>
Persistent Volumes: <none included>
Restic Backups:
Completed:
ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm: bitbucket-postgresql-data
ok101-bitbucket-pr/bitbucket-server-14-ddrpp: bitbucket-server-data
When I make a restore from this backup , The postgres pod is restore properly but not the bitbucket server
tools-bitbucket-backup-prv2-20191018090027 tools-bitbucket-backup-prv2 InProgress 0 0 2019-10-18 09:00:27 +0200 CEST <none>
velero restore describe tools-bitbucket-backup-prv2-20191018090027
Restic Restores:
Completed:
ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm: bitbucket-postgresql-data
New:
ok101-bitbucket-pr/bitbucket-server-14-ddrpp: bitbucket-server-data
The init container is not created on the bitbucket-server pod , so the restic stay stuck in "New" phase but the pod is created and running. It shouldn't
kubectl get po
NAME READY STATUS RESTARTS AGE
bitbucket-server-15-glk7c 1/1 Running 0 5m
* Restic log *
time="2019-10-18T07:00:48Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:00:48Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:00:59Z" level=debug msg="Restore's pod ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm not found, not enqueueing." controller=pod-volume-restore error="pod \"bitbucket-postgresql-5-vvwsm\" not found" logSource="pkg/controller/pod_volume_restore_controller.go:137" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:00:59Z" level=debug msg="Restore's pod ok101-bitbucket-pr/bitbucket-server-14-ddrpp not found, not enqueueing." controller=pod-volume-restore error="pod \"bitbucket-server-14-ddrpp\" not found" logSource="pkg/controller/pod_volume_restore_controller.go:137" name=tools-bitbucket-backup-prv2-20191018090027-gghc8 namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:01Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:01Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:03Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:09Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:09Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:11Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:12Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:13Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:14Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:14Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:19Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:20Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:22Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:31Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
W1018 07:05:36.261218 1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:197: watch of *v1.Secret ended with: The resourceVersion for the provided watch is too old.
W1018 07:05:52.312131 1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeBackup ended with: The resourceVersion for the provided watch is too old.
* Velero Log *
https://gist.github.com/Stolr/02ee7e4ee7d662b94df52de93f953ab3
* PodVolumeRestore *
kubectl -n ok24-velero get podvolumerestores -l velero.io/restore-name=tools-bitbucket-backup-prv2-20191018090027 -o yaml
apiVersion: v1
items:
- apiVersion: velero.io/v1
kind: PodVolumeRestore
metadata:
creationTimestamp: 2019-10-18T07:00:59Z
generateName: tools-bitbucket-backup-prv2-20191018090027-
generation: 1
labels:
velero.io/pod-uid: 0a33afb5-f175-11e9-967b-005056b9b6b7
velero.io/restore-name: tools-bitbucket-backup-prv2-20191018090027
velero.io/restore-uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
name: tools-bitbucket-backup-prv2-20191018090027-6dkgf
namespace: ok24-velero
ownerReferences:
- apiVersion: velero.io/v1
controller: true
kind: Restore
name: tools-bitbucket-backup-prv2-20191018090027
uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
resourceVersion: "853596"
selfLink: /apis/velero.io/v1/namespaces/ok24-velero/podvolumerestores/tools-bitbucket-backup-prv2-20191018090027-6dkgf
uid: 0a35ffec-f175-11e9-967b-005056b9b6b7
spec:
backupStorageLocation: cluster-tools
pod:
kind: Pod
name: bitbucket-postgresql-5-vvwsm
namespace: ok101-bitbucket-pr
uid: 0a33afb5-f175-11e9-967b-005056b9b6b7
repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
snapshotID: 4bd49d6e
volume: bitbucket-postgresql-data
status:
completionTimestamp: 2019-10-18T07:01:31Z
message: ""
phase: Completed
progress:
bytesDone: 83536468
totalBytes: 83536468
startTimestamp: 2019-10-18T07:01:10Z
- apiVersion: velero.io/v1
kind: PodVolumeRestore
metadata:
creationTimestamp: 2019-10-18T07:00:59Z
generateName: tools-bitbucket-backup-prv2-20191018090027-
generation: 1
labels:
velero.io/pod-uid: 0a3b5638-f175-11e9-967b-005056b9b6b7
velero.io/restore-name: tools-bitbucket-backup-prv2-20191018090027
velero.io/restore-uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
name: tools-bitbucket-backup-prv2-20191018090027-gghc8
namespace: ok24-velero
ownerReferences:
- apiVersion: velero.io/v1
controller: true
kind: Restore
name: tools-bitbucket-backup-prv2-20191018090027
uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
resourceVersion: "853188"
selfLink: /apis/velero.io/v1/namespaces/ok24-velero/podvolumerestores/tools-bitbucket-backup-prv2-20191018090027-gghc8
uid: 0a3d1615-f175-11e9-967b-005056b9b6b7
spec:
backupStorageLocation: cluster-tools
pod:
kind: Pod
name: bitbucket-server-14-ddrpp
namespace: ok101-bitbucket-pr
uid: 0a3b5638-f175-11e9-967b-005056b9b6b7
repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
snapshotID: e5a5986e
volume: bitbucket-server-data
status:
completionTimestamp: null
message: ""
phase: ""
startTimestamp: null
kind: List
metadata:
resourceVersion: ""
selfLink: ""
* Environment *
velero version
Client:
Version: v1.1.0
Git commit: a357f21
Server:
Version: v1.1.0
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
openshift v3.11.0+bdd37ad-314
kubernetes v1.11.0+d4cacc0
The namespace does not exist before the restore so every resources is new on the cluster
Any idea ?
Thanks a lot
hmm, based on the following lines:
W1018 07:05:36.261218 1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:197: watch of *v1.Secret ended with: The resourceVersion for the provided watch is too old.
W1018 07:05:52.312131 1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeBackup ended with: The resourceVersion for the provided watch is too old.
it looks like there might be an issue with the informer caches.
Could you try deleting all of the restic daemonset pods, letting them get re-created, and then trying another restore? (you'll want to delete the target namespace as well before kicking off the new restore)
Hi @skriss
Thanks for the answer.
I already try this.
I'm making a restore on another cluster. It is a fresh one so there should not be any cache ?
Can this be the issue ? ( Restoring to another cluster)
I'm not able to try it until monday, but not sure it will fix the issue since I already try on a fresh instance.
Any other idea ? :)
Hey ,
So I made a new fresh install to test this and make sure this is not a cache issue.
Here is the whole procedure to help you debugging ( I install the restic DS before velero because I adapted for okd , this can be the issue maybe.)
Instalation on Cluster Tools && Cluster Tools-B :
kubectl create ns velero
namespace/velero created
oc annotate namespace velero openshift.io/node-selector=""
namespace/velero annotated
oc adm policy add-scc-to-user privileged system:serviceaccount:velero:velero
scc "privileged" added to: ["system:serviceaccount:velero:velero"]
oc apply -f serviceAccount.yaml
serviceaccount/velero created
kubectl apply -f daemonSetrestic.yaml
daemonset.extensions/restic created
velero install \
--provider aws \
--bucket velero \
--use-restic \
--secret-file ./credentials-velero \
--use-volume-snapshots=false \
--backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://oca-miniolb.oca.local/
CustomResourceDefinition/schedules.velero.io: attempting to create resource
CustomResourceDefinition/schedules.velero.io: created
CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource
CustomResourceDefinition/deletebackuprequests.velero.io: created
CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource
CustomResourceDefinition/podvolumerestores.velero.io: created
CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource
CustomResourceDefinition/volumesnapshotlocations.velero.io: created
CustomResourceDefinition/backups.velero.io: attempting to create resource
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource
CustomResourceDefinition/downloadrequests.velero.io: created
CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource
CustomResourceDefinition/podvolumebackups.velero.io: created
CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource
CustomResourceDefinition/resticrepositories.velero.io: created
CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource
CustomResourceDefinition/backupstoragelocations.velero.io: created
CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource
CustomResourceDefinition/serverstatusrequests.velero.io: created
CustomResourceDefinition/restores.velero.io: attempting to create resource
CustomResourceDefinition/restores.velero.io: created
Waiting for resources to be ready in cluster...
Namespace/velero: attempting to create resource
Namespace/velero: already exists, proceeding
Namespace/velero: created
ClusterRoleBinding/velero: attempting to create resource
ClusterRoleBinding/velero: created
ServiceAccount/velero: attempting to create resource
ServiceAccount/velero: already exists, proceeding
ServiceAccount/velero: created
Secret/cloud-credentials: attempting to create resource
Secret/cloud-credentials: created
BackupStorageLocation/default: attempting to create resource
BackupStorageLocation/default: created
Deployment/velero: attempting to create resource
Deployment/velero: created
DaemonSet/restic: attempting to create resource
DaemonSet/restic: already exists, proceeding
DaemonSet/restic: created
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
Cluster Tools Backup location
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
creationTimestamp: 2019-10-21T06:14:03Z
generation: 1
labels:
component: velero
name: default
namespace: velero
resourceVersion: "48443311"
selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
uid: fb1ae7ac-f3c9-11e9-843a-005056b9cf2b
spec:
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://oca-miniolb.oca.local/
objectStorage:
bucket: velero
prefix: "tools"
provider: aws
Cluster Tools-B Backup location
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
creationTimestamp: 2019-10-21T06:16:31Z
generation: 1
labels:
component: velero
name: default
namespace: velero
resourceVersion: "1743001"
selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
uid: 53050648-f3ca-11e9-8991-005056b92845
spec:
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://oca-miniolb.oca.local/
objectStorage:
bucket: velero
prefix: "tools-b"
provider: aws
velero backup-location create cluster-tools \
--provider aws \
--bucket velero \
--access-mode ReadOnly \
--config region=minio,s3ForcePathStyle="true",s3Url=http://oca-miniolb.oca.local/
```
And then I edited BackupLocation cluster-tools to add "tools" Prefix.
Now everything is running fine :
kubectl get po
NAME READY STATUS RESTARTS AGE
restic-2s7n8 1/1 Running 0 6m
restic-2zcr7 1/1 Running 0 6m
restic-7wbrt 1/1 Running 0 6m
restic-8zn8n 1/1 Running 0 6m
restic-cfq77 1/1 Running 0 6m
restic-djvrj 1/1 Running 0 6m
restic-kpnvt 1/1 Running 0 6m
restic-n58w2 1/1 Running 0 6m
restic-n5c6x 1/1 Running 0 6m
restic-ssvpp 1/1 Running 0 6m
restic-wjxj7 1/1 Running 0 6m
restic-wsj94 1/1 Running 0 6m
restic-xgxjt 1/1 Running 0 6m
restic-zvfvj 1/1 Running 0 6m
velero-df87fbb89-m2tbh 1/1 Running 2 6m
**Cluster Tools :**
Creating the backup
kubectl -n ok101-bitbucket-pr annotate pod/bitbucket-postgresql-5-vvwsm backup.velero.io/backup-volumes=bitbucket-postgresql-data
kubectl -n ok101-bitbucket-pr annotate pod/bitbucket-server-14-ddrpp backup.velero.io/backup-volumes=bitbucket-server-data
velero backup create tools-bitbucket-backup --include-namespaces=ok101-bitbucket-pr
Backup get
velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
tools-bitbucket-backup PartiallyFailed (2 errors) 2019-10-21 08:23:26 +0200 CEST 29d default <none>
Velero Logs :
https://gist.github.com/Stolr/23d0dd11b301150ccb336a12b77107a1
**Backup description**
velero backup describe tools-bitbucket-backup --details
https://gist.github.com/Stolr/9b862178df8f951cbd9b50357bd502c8
**Backup logs**
velero backup logs tools-bitbucket-backup
https://gist.github.com/Stolr/293051c52536541fec55f924f76386be
I can see there is 2 error, but it says my restic are completed. It should not be relevant. This is probably due to some pods not correct in that namespace. First time didn't have that error but the restic issue was here.
**Now , On Cluster Tools-B**
velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
tools-bitbucket-backup PartiallyFailed (2 errors) 2019-10-21 08:23:26 +0200 CEST 29d cluster-tools
velero restore create --include-namespaces=ok101-bitbucket-pr --from-backup tools-bitbucket-backup
Restore request "tools-bitbucket-backup-20191021083745" submitted successfully.
Run velero restore describe tools-bitbucket-backup-20191021083745 or velero restore logs tools-bitbucket-backup-20191021083745 for more details.
Same issue :
The restore Stay in progress because of the restic not restored
velero restore get
NAME BACKUP STATUS WARNINGS ERRORS CREATED SELECTOR
tools-bitbucket-backup-20191021083745 tools-bitbucket-backup InProgress 0 0 2019-10-21 08:37:45 +0200 CEST
kubectl get po -n ok101-bitbucket-pr
NAME READY STATUS RESTARTS AGE
bitbucket-postgresql-5-vvwsm 0/1 PodInitializing 0 39s
bitbucket-pr-data-backup-1571351700-gfhf4 0/1 Pending 0 39s
bitbucket-pr-data-backup-1571438100-jb2dv 0/1 Pending 0 39s
bitbucket-pr-postgresql-backup-1571436000-rtsf7 0/1 Pending 0 39s
bitbucket-server-15-6528q 1/1 Running 0 34s
**Restic Logs**
https://gist.github.com/Stolr/dabac536a5235b87ecd184045ab2e7b5
**Velero Logs**
https://gist.github.com/Stolr/5dfc2f7fea9c63f0ddbd61d9276ac984
**Restore Logs**
No available
Logs for restore "tools-bitbucket-backup-20191021083745" are not available until it's finished processing. Please wait until the restore has a phase of Completed or Failed and try again.
**PodVolumeRestore**
kubectl -n velero get podvolumerestores -l velero.io/restore-name=tools-bitbucket-backup-20191021083745 -o yaml
apiVersion: v1
items:
My bitbucket data is not restored. No init container is created. But the postgres one is working as espected.
Do you find something in all theses logs that can explain this ?
Thanks for your help !
@Stolr i'm not exactly sure what's going on, but I do see that during the backup, the pod that's being backed up is bitbucket-server-14-ddrpp, and then during/after restore, you end up with pod bitbucket-server-15-6528q. I do see in the Velero server log that during the restore, pod bitbucket-server-14-ddrpp is restored, but it seems like it's probably being deleted and replaced with bitbucket-server-15-6528q.
I'm not super-familiar with OpenShift's deploymentconfigs and (apparently) their use of replication controllers, but in plain vanilla Kubernetes, the way this would work is we'd restore pod "14", then restore the replicaset controlling it, and that replicaset would see pod "14" and "adopt" it. It seems like possibly, something about the deploymentconfig/replicationcontroller is preventing this "adoption" from happening, and triggering the creating of a new pod "15".
Does this ring any bells for you? Maybe we can figure it out together :)
@Stolr @skriss sorry to bump into conversation, just a thought. Instead of annotating the pod itself, can you try annotating the pod template spec of the parent controller, i.e.: Deployment or ReplicationController?
@skriss Wow thanks !!
You are right , the deployment number is not the same.
For some reason , openshift trigger a new deploy. Probably because of all resource beeing restore. No way to return on the 14 even with a rollback.
I'm not super familiar also with Openshift
I got rid of that deploymentConfig ( no point using it ) , and adapt everything using normal deployment.
Everything is working as espected using deployment.
@yashbhutwala : I Might try this when i will be able. Thanks for your answer.
Thanks both for your help. Since this issue is related to Openshift , you can close the issue if you want or rename it.
Best Regards
Thanks again for helping me getting through this.
@sseago @dymurray do you guys have any thoughts on what's going on here? (https://github.com/vmware-tanzu/velero/issues/1981#issuecomment-544709044)
Off the top of my head, I'm not sure what's going on, although I haven't looked at the logs in detail yet. The redeployment of a new pod may well be affecting things here, since the new pod probably won't have the restic annotation. For the work my group has been doing, we actually do a two-phase backup/restore, in part to eliminate as much complexity as possible from the environment Restic is working in. We create a full backup without any restic annotations, and then a limited backup with just the PVs/PVCs and pods which mount them with the restic annotations. Then, on restore we first restore the restic backup (pods only, no deployments, deploymentconfigs, etc.) -- this is when the restic copies happen. Then those restored pods are deleted and we do the full restore (without restic annotations). I don't know that all of this is necessary for a basic backup/restore -- in our case we're using it for app migration from one cluster to another, with the possibility of running the restic/PV migration more than once before the final migration. In any case, if you're restoring deploymentconfigs which are then rolling out new pods post-restore, that could definitely interfere with restic. I don't know what the appropriate general-purpose answer is here -- our approach has been for a very specific migration use case. I wonder whether the same issue comes up with non-OpenShift resources. Daemonsets, Deployments, etc. Annotating the pod template spec, as suggested above (in addition to annotating the pod) may be the way to go here. I"m not sure whether it will resolve this issue completely or not, though.
To add on to what Scott said, yes we hit this same problem very early on. This is a problem that extends beyond OCP specific restores, my understanding is that any pod which is managed by another resource faces this risk.
If a pod is managed by another resource the restic restore will generally fail since both the pod and the managing resource is restored which causes the initial pod (with the restic annotation) to be overwritten. I could have sworn there was an open issue on this but I can't seem to find it right now.
@dymurray not sure if this covers all of what you say, but I made an issue a month ago facing a similar issue. See: #1919
If a pod is managed by another resource the restic restore will generally fail since both the pod and the managing resource is restored which causes the initial pod (with the restic annotation) to be overwritten. I could have sworn there was an open issue on this but I can't seem to find it right now.
We haven't seen this, at least not with pods managed by replicasets/deployments. Per my comment (https://github.com/vmware-tanzu/velero/issues/1981#issuecomment-544709044), during a restic restore, we first restore the pod & trigger a restic restore, then restore the owning replicaset and deployment. The pod is successfully "adopted" by the replicaset, since the pod's spec matches the pod template spec from the replicaset.
If that behavior were different, then I agree it would likely cause problems with restic restores, which seems to be what we're seeing here. Can you shed any more light onto why the DeploymentConfig restore is triggering the creation of a new pod, rather than adopting the existing one?
From what I've seen with DeploymentConfigs they don't always trigger new pods, but sometimes they do. I believe they actually do (initially) adopt the restored pod, as expected, but if there's a ConfigChange trigger registered, then the restore event on the DeploymentConfig will sometimes trigger that if the restore process looks like a configuration change. Most of my experience here is in restoring resources to a different cluster than the backup came from, with some spec params modified by a plugin on restore ("image" references, for example, if the image is located in an in-cluster registry). The pod as restored will run for a short amount of time, but will terminate as soon as the ConfigChange triggered replacement is ready. Most recently, this week I've restored a couple DeploymentConfigs to the same cluster as the backup was run in, and in that case I did not see a replacement being created post-restore.
So I spent some time digging into this, and based on what I've learned I can say that yes the method Velero is currently taking with restic restores has it's shortcomings. Currently, we are lucky that a deployment doesn't trigger a new generation of the pod in 99% of the restore use cases. If you specifically trigger a redeploy during the restic restore then things will break as shown in #1919 .
With deploymentconfigs, there are a number of triggers you can set which will trigger the redeploy of a pod, but the bigger issue is that currently with DCs the pod is restored first with the restic annotation and then later adopted to the DC controller and redeployed wiping the annotation out. If a plugin is used to not restore a pod if it's managed by a DC in conjunction with placing the annotation on the DC pod template spec then the restic restore has a good chance of succeeding, but the same concern that Kubernetes could trigger a new deployment for deployments and deploymentconfigs during restore is a larger problem that needs to be solved.
open to ideas on how to improve this. the data populator KEP that's making the rounds upstream may be relevant/useful, though AFAIK it's only for PVs, not any pod volume.
Well, I had just the same problem! Restore completed, no errors in logs but the PV is completely empty! Sucks.
I wanted to restore only PVC with PV itself and did it:
velero restore create --from-backup daily-20200528020046 --include-namespaces test-project --include-resources persistentvolumeclaims,persistentvolumes --restore-volumes=true
Completed, no errors. But there is no data at all.
I did not suspect that. Is there any way to make it working with restic?
I have DeploymentConfig but "replicas" is set to 0 and I removed ConfigChange from triggers.
What is interesting I tested it before but only after removing a whole project and then it was ok and even data was there. So it works only during restoring of whole projects? It is not possible to restore just a volume?
I can confirm - I can restore volumes only restoring a whole project. So a whole namespace - it must be empty.
You cannot restore volumes if there are some objects like deployments or other things. You cannot restore PVC with PV themselves separately using restic.
So in my case, I needed to restore to a mapped temporary namespace. Then go there and scale everything down. Then spin a new POD just to attach PV and rsync data out of the volume to my host. Then I deleted temporary namespace. I run the helper POD again in my original project and I needed there to connect to PV and rsync all the data there. Later I did chown with the user ID of the container. Removed helper POD and then finally scale up the deployment. And it worked and data was there from the backup snapshot. But the process is very inconvenient in such cases, very clumsy.
I'm facing this issue when I'm restoring a backup of prometheus-operator. My restore tests was done in the same cluster where backup lives but in another namespace. The production application was still live on it's own namespace.
My cluster is running in EKS. It's version is 1.16.
There are three PVs that should be backed-up: grafana, prometheus and alertmanager. Prometheus and grafana PVs could be restored without problems but alertmanager PV no, because alertmanager Statefulset is dynamically created by an Alertmanager object (from monitoring.coreos.com/v1 API). I can see in velero logs that it could successfully restore the alertmanager pod and could inject the restic-wait container on it. But, when Alertmanager object is restored, it creates the Statefulset which replaces the pod.
This is the velero logs that proves the restic-wait container creation on alertmanager pod:
time="2020-08-18T11:23:39Z" level=info msg="Restoring resource 'pods' into namespace 'monitoring-restored'" logSource="pkg/restore/restore.go:702" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Getting client for /v1, Kind=Pod" logSource="pkg/restore/restore.go:746" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing AddPVCFromPodAction" cmd=/velero logSource="pkg/restore/add_pvc_from_pod_action.go:44" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Adding PVC monitoring/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0 as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pvc_from_pod_action.go:58" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Skipping persistentvolumeclaims/monitoring-restored/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0 because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:69" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Restic backups for pod found" cmd=/velero logSource="pkg/restore/restic_restore_action.go:95" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Getting plugin config" cmd=/velero logSource="pkg/restore/restic_restore_action.go:99" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:160" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Using image \"velero/velero-restic-restore-helper:v1.4.2\"" cmd=/velero logSource="pkg/restore/restic_restore_action.go:106" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:195" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:206" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Done executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:155" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Attempting to restore Pod: alertmanager-prometheus-operator-alertmanager-0" logSource="pkg/restore/restore.go:1070" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Acquiring lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:122" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Acquired lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:131" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Ready repository found" backupLocation=default logSource="pkg/restic/repository_ensurer.go:147" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Released lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:128" volumeNamespace=monitoring
1 second later, the Alertmanager object is restored:
time="2020-08-18T11:23:40Z" level=info msg="Restoring resource 'alertmanagers.monitoring.coreos.com' into namespace 'monitoring-restored'" logSource="pkg/restore/restore.go:702" restore=velero/monitoring
time="2020-08-18T11:23:40Z" level=info msg="Getting client for monitoring.coreos.com/v1, Kind=Alertmanager" logSource="pkg/restore/restore.go:746" restore=velero/monitoring
time="2020-08-18T11:23:40Z" level=info msg="Attempting to restore Alertmanager: prometheus-operator-alertmanager" logSource="pkg/restore/restore.go:1070" restore=velero/monitoring
This is the backup's content:
velero backup describe monitoring --details
Name: monitoring
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.16.8-eks-e16311
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=16+
Phase: Completed
Errors: 0
Warnings: 0
Namespaces:
Included: monitoring
Excluded: <none>
Resources:
Included: *
Excluded: certificates.cert-manager.io, certificaterequests.cert-manager.io, orders.acme.cert-manager.io
Cluster-scoped: auto
Label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks: <none>
Backup Format Version: 1
Started: 2020-08-18 10:17:16 +0200 CEST
Completed: 2020-08-18 10:17:57 +0200 CEST
Expiration: 2020-09-17 10:17:16 +0200 CEST
Total items to be backed up: 234
Items backed up: 234
Resource List:
apiextensions.k8s.io/v1/CustomResourceDefinition:
- alertmanagers.monitoring.coreos.com
- prometheuses.monitoring.coreos.com
- prometheusrules.monitoring.coreos.com
- servicemonitors.monitoring.coreos.com
apps/v1/ControllerRevision:
- monitoring/alertmanager-prometheus-operator-alertmanager-54df75fb5b
- monitoring/prometheus-operator-prometheus-node-exporter-599f4fbbfd
- monitoring/prometheus-prometheus-operator-prometheus-6cbd9d8d8b
apps/v1/DaemonSet:
- monitoring/prometheus-operator-prometheus-node-exporter
apps/v1/Deployment:
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-operator
apps/v1/ReplicaSet:
- monitoring/prometheus-operator-grafana-5986dbf74f
- monitoring/prometheus-operator-grafana-7ff4f8b97b
- monitoring/prometheus-operator-kube-state-metrics-6f8cc5ffd5
- monitoring/prometheus-operator-operator-fd978d8d7
apps/v1/StatefulSet:
- monitoring/alertmanager-prometheus-operator-alertmanager
- monitoring/prometheus-prometheus-operator-prometheus
extensions/v1beta1/Ingress:
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-prometheus
monitoring.coreos.com/v1/Alertmanager:
- monitoring/prometheus-operator-alertmanager
monitoring.coreos.com/v1/Prometheus:
- monitoring/prometheus-operator-prometheus
monitoring.coreos.com/v1/PrometheusRule:
- monitoring/prometheus-operator-alertmanager.rules
- monitoring/prometheus-operator-etcd
- monitoring/prometheus-operator-general.rules
- monitoring/prometheus-operator-k8s.rules
- monitoring/prometheus-operator-kube-apiserver-slos
- monitoring/prometheus-operator-kube-apiserver.rules
- monitoring/prometheus-operator-kube-prometheus-general.rules
- monitoring/prometheus-operator-kube-prometheus-node-recording.rules
- monitoring/prometheus-operator-kube-scheduler.rules
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-kubelet.rules
- monitoring/prometheus-operator-kubernetes-apps
- monitoring/prometheus-operator-kubernetes-resources
- monitoring/prometheus-operator-kubernetes-storage
- monitoring/prometheus-operator-kubernetes-system
- monitoring/prometheus-operator-kubernetes-system-apiserver
- monitoring/prometheus-operator-kubernetes-system-controller-manager
- monitoring/prometheus-operator-kubernetes-system-kubelet
- monitoring/prometheus-operator-kubernetes-system-scheduler
- monitoring/prometheus-operator-node-exporter
- monitoring/prometheus-operator-node-exporter.rules
- monitoring/prometheus-operator-node-network
- monitoring/prometheus-operator-node.rules
- monitoring/prometheus-operator-prometheus
- monitoring/prometheus-operator-prometheus-operator
monitoring.coreos.com/v1/ServiceMonitor:
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-apiserver
- monitoring/prometheus-operator-coredns
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-kube-controller-manager
- monitoring/prometheus-operator-kube-etcd
- monitoring/prometheus-operator-kube-proxy
- monitoring/prometheus-operator-kube-scheduler
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-kubelet
- monitoring/prometheus-operator-node-exporter
- monitoring/prometheus-operator-operator
- monitoring/prometheus-operator-prometheus
networking.k8s.io/v1beta1/Ingress:
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-prometheus
rbac.authorization.k8s.io/v1/ClusterRole:
- prometheus-operator-grafana-clusterrole
- prometheus-operator-kube-state-metrics
- prometheus-operator-operator
- prometheus-operator-operator-psp
- prometheus-operator-prometheus
- prometheus-operator-prometheus-psp
- psp-prometheus-operator-kube-state-metrics
- psp-prometheus-operator-prometheus-node-exporter
rbac.authorization.k8s.io/v1/ClusterRoleBinding:
- prometheus-operator-grafana-clusterrolebinding
- prometheus-operator-kube-state-metrics
- prometheus-operator-operator
- prometheus-operator-operator-psp
- prometheus-operator-prometheus
- prometheus-operator-prometheus-psp
- psp-prometheus-operator-kube-state-metrics
- psp-prometheus-operator-prometheus-node-exporter
rbac.authorization.k8s.io/v1/Role:
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-grafana-test
rbac.authorization.k8s.io/v1/RoleBinding:
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-grafana-test
v1/ConfigMap:
- monitoring/prometheus-operator-apiserver
- monitoring/prometheus-operator-cluster-total
- monitoring/prometheus-operator-controller-manager
- monitoring/prometheus-operator-etcd
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-grafana-config-dashboards
- monitoring/prometheus-operator-grafana-datasource
- monitoring/prometheus-operator-grafana-test
- monitoring/prometheus-operator-k8s-coredns
- monitoring/prometheus-operator-k8s-resources-cluster
- monitoring/prometheus-operator-k8s-resources-namespace
- monitoring/prometheus-operator-k8s-resources-node
- monitoring/prometheus-operator-k8s-resources-pod
- monitoring/prometheus-operator-k8s-resources-workload
- monitoring/prometheus-operator-k8s-resources-workloads-namespace
- monitoring/prometheus-operator-kubelet
- monitoring/prometheus-operator-namespace-by-pod
- monitoring/prometheus-operator-namespace-by-workload
- monitoring/prometheus-operator-node-cluster-rsrc-use
- monitoring/prometheus-operator-node-rsrc-use
- monitoring/prometheus-operator-nodes
- monitoring/prometheus-operator-persistentvolumesusage
- monitoring/prometheus-operator-pod-total
- monitoring/prometheus-operator-prometheus
- monitoring/prometheus-operator-proxy
- monitoring/prometheus-operator-scheduler
- monitoring/prometheus-operator-statefulset
- monitoring/prometheus-operator-workload-total
- monitoring/prometheus-prometheus-operator-prometheus-rulefiles-0
v1/Endpoints:
- monitoring/alertmanager-operated
- monitoring/prometheus-operated
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-operator
- monitoring/prometheus-operator-prometheus
- monitoring/prometheus-operator-prometheus-node-exporter
v1/Event:
- monitoring/prometheus-operator-admission-create-ngxh5.162c4eac037b378f
- monitoring/prometheus-operator-admission-create-ngxh5.162c4eac3d7a4c20
- monitoring/prometheus-operator-admission-create-ngxh5.162c4eacfd856868
- monitoring/prometheus-operator-admission-create-ngxh5.162c4ead0a39ac70
- monitoring/prometheus-operator-admission-create-ngxh5.162c4ead13445eeb
- monitoring/prometheus-operator-admission-create-ngxh5.162c4ead713ac0dc
- monitoring/prometheus-operator-admission-create-ngxh5.162c4ead8cff268e
- monitoring/prometheus-operator-admission-create.162c4eac0309e0cb
- monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb4cb068bca
- monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb517441275
- monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb51d3ac352
- monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb52b6739be
- monitoring/prometheus-operator-admission-patch.162c4eb4ca533c92
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e619ce31070
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e637284176b
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e71b870b6b4
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7b2a4186a1
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ba8d7beae
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c23594737
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c2b84195d
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c36882b14
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c67022081
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e1d1be052
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e2fa25cfc
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e40871cd9
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ec738b8ab
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7eca575457
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ed9a8b480
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ed9d7309c
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e826db5397b
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e82883b0412
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e82a018a2d3
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4eb25495f5c9
- monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4eb25496faf5
- monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e6199f96154
- monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e619a02fad3
- monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e619a049a56
- monitoring/prometheus-operator-grafana-5986dbf74f.162c4e619c9d5316
- monitoring/prometheus-operator-grafana-5986dbf74f.162c4eb254704b48
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf32ce5cd2
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf7c0b856d
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf7f2b718e
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf874cd7a1
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9133924c
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9468b3d9
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9decda56
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafce8b1887
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafd2252390
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafdbc8ec47
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafdc3af0c0
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafe8f543b5
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaff4fd17b2
- monitoring/prometheus-operator-grafana-7ff4f8b97b.162c4eaf31f3b2ba
- monitoring/prometheus-operator-grafana.162c4eaf3087e1e1
- monitoring/prometheus-operator-grafana.162c4eb253680d39
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e71bff6bc71
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7210a514b8
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e735b1ee1b9
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7405ed1b22
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e74199ddb10
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7b19464cfd
- monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7b1a45f166
- monitoring/prometheus-operator-prometheus-node-exporter.162c4e71bdefdbaf
- monitoring/prometheus-operator-prometheus-node-exporter.162c4e7b1a499523
v1/Namespace:
- monitoring
v1/PersistentVolume:
- pvc-502cf99f-99fb-4a83-abd9-2a15bcf2a30d
- pvc-7107894a-2ede-473e-9c24-2cb5a3f9d7f1
- pvc-e6d638c0-b4a8-4bcf-a9d1-1f66c387c7e9
v1/PersistentVolumeClaim:
- monitoring/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-prometheus-operator-prometheus-db-prometheus-prometheus-operator-prometheus-0
v1/Pod:
- monitoring/alertmanager-prometheus-operator-alertmanager-0
- monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk
- monitoring/prometheus-operator-kube-state-metrics-6f8cc5ffd5-47jbw
- monitoring/prometheus-operator-operator-fd978d8d7-cf956
- monitoring/prometheus-operator-prometheus-node-exporter-fxl7s
- monitoring/prometheus-prometheus-operator-prometheus-0
v1/Secret:
- monitoring/alertmanager-prometheus-operator-alertmanager
- monitoring/alertmanager.ict.navinfo.cloud-tls
- monitoring/default-token-vf8dm
- monitoring/grafana.ict.navinfo.cloud-tls
- monitoring/ict-admission
- monitoring/prometheus-operator-admission
- monitoring/prometheus-operator-alertmanager-token-jxljb
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-grafana-test-token-q5lsl
- monitoring/prometheus-operator-grafana-token-949ch
- monitoring/prometheus-operator-kube-state-metrics-token-9gsz5
- monitoring/prometheus-operator-operator-token-556vs
- monitoring/prometheus-operator-prometheus-node-exporter-token-9f545
- monitoring/prometheus-operator-prometheus-token-bxb9w
- monitoring/prometheus-prometheus-operator-prometheus
- monitoring/prometheus-prometheus-operator-prometheus-tls-assets
- monitoring/prometheus.ict.navinfo.cloud-tls
- monitoring/sh.helm.release.v1.prometheus-operator.v1
- monitoring/sh.helm.release.v1.prometheus-operator.v2
v1/Service:
- monitoring/alertmanager-operated
- monitoring/prometheus-operated
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-operator
- monitoring/prometheus-operator-prometheus
- monitoring/prometheus-operator-prometheus-node-exporter
v1/ServiceAccount:
- monitoring/default
- monitoring/prometheus-operator-alertmanager
- monitoring/prometheus-operator-grafana
- monitoring/prometheus-operator-grafana-test
- monitoring/prometheus-operator-kube-state-metrics
- monitoring/prometheus-operator-operator
- monitoring/prometheus-operator-prometheus
- monitoring/prometheus-operator-prometheus-node-exporter
Velero-Native Snapshots: <none included>
Restic Backups:
Completed:
monitoring/alertmanager-prometheus-operator-alertmanager-0: alertmanager-prometheus-operator-alertmanager-db
monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk: storage
monitoring/prometheus-prometheus-operator-prometheus-0: prometheus-prometheus-operator-prometheus-db
This is the restore details. Note that velero couldn't restore alertmanager-prometheus-operator-alertmanager Statefulset because it was already created by Alertmanager object. It couldn't restore prometheus-prometheus-operator-prometheus Statefulset also because it is created by Prometheus object (other prometheus-operator CRD). But, it's PV could be restored because the created Statefulset could "adopt" the restored POD. I have no clues why alertmanager Statefulset couldn't "adopt" the restored alertmanager POD. Perhaps a racing condition or something else...
Name: monitoring
Namespace: velero
Labels: <none>
Annotations: <none>
Phase: PartiallyFailed (run 'velero restore logs monitoring' for more information)
Warnings:
Velero: <none>
Cluster: could not restore, customresourcedefinitions.apiextensions.k8s.io "alertmanagers.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, customresourcedefinitions.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, customresourcedefinitions.apiextensions.k8s.io "prometheusrules.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-grafana-clusterrolebinding" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-kube-state-metrics" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-operator-psp" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-operator" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-prometheus-psp" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-prometheus" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "psp-prometheus-operator-kube-state-metrics" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, clusterrolebindings.rbac.authorization.k8s.io "psp-prometheus-operator-prometheus-node-exporter" already exists. Warning: the in-cluster version is different than the backed-up version.
Namespaces:
monitoring-restored: could not restore, endpoints "alertmanager-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, services "alertmanager-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, services "prometheus-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, statefulsets.apps "alertmanager-prometheus-operator-alertmanager" already exists. Warning: the in-cluster version is different than the backed-up version.
could not restore, statefulsets.apps "prometheus-prometheus-operator-prometheus" already exists. Warning: the in-cluster version is different than the backed-up version.
Errors:
Velero: timed out waiting for all PodVolumeRestores to complete
Cluster: <none>
Namespaces: <none>
Backup: monitoring
Namespaces:
Included: all namespaces found in the backup
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
Cluster-scoped: auto
Namespace mappings: monitoring=monitoring-restored
Label selector: <none>
Restore PVs: auto
Restic Restores:
Completed:
monitoring-restored/prometheus-operator-grafana-7ff4f8b97b-jxwzk: storage
monitoring-restored/prometheus-prometheus-operator-prometheus-0: prometheus-prometheus-operator-prometheus-db
New:
monitoring-restored/alertmanager-prometheus-operator-alertmanager-0: alertmanager-prometheus-operator-alertmanager-db
I'll try to first restore the Pods and PVs and then the rest.
The PVs restore by using the bellow command was successfully executed:
velero restore create monitoring-1 --from-backup monitoring --namespace-mappings monitoring:monitoring-restored \
--exclude-resources=alertmanager.monitoring.coreos.com,prometheuses.monitoring.coreos.com
After that, I could restore without worries the Alertmanager and Prometheuses objects:
velero restore create monitoring-cdrs --from-backup monitoring --namespace-mappings monitoring:monitoring-restored \
--include-resources=alertmanager.monitoring.coreos.com,prometheuses.monitoring.coreos.com
Closing this because this issue was (mostly) resolved for the reporter, but @sseago or @dymurray feel free to reopen if you want to work on this.
Most helpful comment
Off the top of my head, I'm not sure what's going on, although I haven't looked at the logs in detail yet. The redeployment of a new pod may well be affecting things here, since the new pod probably won't have the restic annotation. For the work my group has been doing, we actually do a two-phase backup/restore, in part to eliminate as much complexity as possible from the environment Restic is working in. We create a full backup without any restic annotations, and then a limited backup with just the PVs/PVCs and pods which mount them with the restic annotations. Then, on restore we first restore the restic backup (pods only, no deployments, deploymentconfigs, etc.) -- this is when the restic copies happen. Then those restored pods are deleted and we do the full restore (without restic annotations). I don't know that all of this is necessary for a basic backup/restore -- in our case we're using it for app migration from one cluster to another, with the possibility of running the restic/PV migration more than once before the final migration. In any case, if you're restoring deploymentconfigs which are then rolling out new pods post-restore, that could definitely interfere with restic. I don't know what the appropriate general-purpose answer is here -- our approach has been for a very specific migration use case. I wonder whether the same issue comes up with non-OpenShift resources. Daemonsets, Deployments, etc. Annotating the pod template spec, as suggested above (in addition to annotating the pod) may be the way to go here. I"m not sure whether it will resolve this issue completely or not, though.