What steps did you take and what happened:
Attempting to access detailed status of a backup, or review the logs.
Timeout's are issued.
What did you expect to happen:
Both to be displayed as required.
The output of the following commands will help us better understand what's going on:
velero backup describe daily-backup-c006-20191112133437 --details
Name: daily-backup-c006-20191112133437
Namespace: velero
Labels: velero.io/backup=daily-backup-c006-20191112133437
velero.io/pv=pvc-5ce737ae-fd4c-11e9-8c44-42010a3ce057
velero.io/schedule-name=daily-backup-c006
velero.io/storage-location=default
Annotations: <none>
Phase: PartiallyFailed (run `velero backup logs daily-backup-c006-20191112133437` for more information)
Errors: 1
Warnings: 0
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Storage Location: default
Snapshot PVs: auto
TTL: 120h0m0s
Hooks: <none>
Backup Format Version: 1
Started: 2019-11-12 06:34:37 -0700 MST
Completed: 2019-11-12 06:35:07 -0700 MST
Expiration: 2019-11-17 06:34:37 -0700 MST
Resource List: <error getting backup resource list: timed out waiting for download URL>
Persistent Volumes: <error getting volume snapshot info: timed out waiting for download URL>
Restic Backups:
Completed:
database/postgresql-postgresql-0: data
Failed:
database/mariadb-0: data
And
velero backup logs daily-backup-c006-20191112133437
An error occurred: timed out waiting for download URL
Anything else you would like to add:
This a private GKE cluster.
Environment:
velero version
Client:
Version: v1.2.0
Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
Version: v1.2.0
kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-gke.9", GitCommit:"6c1e92d07f5717440f751666d4aad6943015d3cb", GitTreeState:"clean", BuildDate:"2019-10-11T23:14:17Z", GoVersion:"go1.11.13b4", Compiler:"gc", Platform:"linux/amd64"}
This also seems to cause lots of these and failures to backup PV's
time="2019-11-13T10:54:12Z" level=error msg="Error backing up item" backup=velero/daily-backup-c007-20191113105351 error="pod volume backup failed: error running restic backup, stderr=Fatal: invalid id \"8860b8f7\": no matching ID found\n: unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:288" name=cseven-zookeeper-2 namespace= resource=pods
The An error occurred: timed out waiting for download URL error usually indicates the Velero server isn't running properly. Can you check if the pod is running (kubectl -n velero get pods), and also get the logs for it? (kubectl -n velero logs deploy/velero)
I did check those they all seem fine, and backups are taken and data shows in the bucket.
kc get pods -n velero
NAME READY STATUS RESTARTS AGE
restic-4cdr9 1/1 Running 0 26h
restic-67tzb 1/1 Running 0 26h
restic-9bp4w 1/1 Running 0 26h
restic-x7ljs 1/1 Running 0 26h
velero-75bd7659b-g2kx4 1/1 Running 0 26h
As mentioned above there is normal logs but a lot of errors which are seemingly tied in
time="2019-11-13T15:20:51Z" level=info msg="Checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:456"
time="2019-11-13T15:20:51Z" level=info msg="Done checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:484"
time="2019-11-13T15:33:42Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021123 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021145 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021204 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021216 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111085328 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111091931 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021003 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:46Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021028 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:46Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021333 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021058 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021107 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021114 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111085352 logSource="pkg/controller/generic_controller.go:137"
Thanks.
Hmm, seems like maybe there are some orphaned invalid requests in there. Can you kubectl -n velero delete downloadrequests.velero.io --all, then try running a logs or describe --details command again?
There was some
kubectl -n velero delete downloadrequests.velero.io --all
downloadrequest.velero.io "test-backup-20191111085328" deleted
downloadrequest.velero.io "test-backup-20191111085352" deleted
downloadrequest.velero.io "test-backup-20191111091931" deleted
downloadrequest.velero.io "test-backup-20191112021003" deleted
downloadrequest.velero.io "test-backup-20191112021028" deleted
downloadrequest.velero.io "test-backup-20191112021058" deleted
downloadrequest.velero.io "test-backup-20191112021107" deleted
downloadrequest.velero.io "test-backup-20191112021114" deleted
downloadrequest.velero.io "test-backup-20191112021123" deleted
downloadrequest.velero.io "test-backup-20191112021145" deleted
downloadrequest.velero.io "test-backup-20191112021204" deleted
downloadrequest.velero.io "test-backup-20191112021216" deleted
downloadrequest.velero.io "test-backup-20191112021333" deleted
But the issues with --details still remain.
Resource List: <error getting backup resource list: timed out waiting for download URL>
Persistent Volumes: <error getting volume snapshot info: timed out waiting for download URL>
But the errors in the logs have ceased :)
Is there anything in the logs from when you just ran the --details cmd?
Ah apologies yes those errors appear again during that command:-
time="2019-11-13T16:03:10Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/daily-backup-c007-20191113105351-20191113090303 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T16:03:16Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/daily-backup-c007-20191113105351-20191113090303 logSource="pkg/controller/generic_controller.go:137"
Can you look in the actual GCS bucket (I assume you're using GCS?) and see what files are in there under backups/<backup-name>?
All looks OK to me
gsutil ls gs://<project>-velero/backups/daily-backup-c007-20191113105351/
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-logs.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-podvolumebackups.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-resource-list.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-volumesnapshots.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351.tar.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/velero-backup.json
OK...looks like the files are there - I'm not seeing any indication of why you'd be getting 404's.
How about kubectl -n velero get downloadrequests.velero.io -o yaml?
This shows:-
kubectl -n velero get downloadrequests.velero.io -o yaml
apiVersion: v1
items: []
kind: List
metadata:
resourceVersion: ""
selfLink: ""
OK, how about running another logs/describe --details command, seeing if there's an error, and if so, getting the pod logs and running kubectl -n velero get downloadrequests.velero.io -o yaml?
So the logs show the 404's as before when running a describe --details and the k8s object shows
apiVersion: v1
items:
- apiVersion: velero.io/v1
kind: DownloadRequest
metadata:
creationTimestamp: "2019-11-13T16:16:46Z"
generation: 1
name: daily-backup-c007-20191113105351-20191113091646
namespace: velero
resourceVersion: "9162134"
selfLink: /apis/velero.io/v1/namespaces/velero/downloadrequests/daily-backup-c007-20191113105351-20191113091646
uid: fcfea37d-0630-11ea-b190-42010a3ce069
spec:
target:
kind: BackupVolumeSnapshots
name: daily-backup-c007-20191113105351
status: {}
kind: List
metadata:
resourceVersion: ""
selfLink: "
Which is curious as I ran velero backup describe daily-backup-c007-20191113105351 --details
I'm not seeing anything wrong here. Velero is supposed to be able to get a pre-signed URL for these objects in object storage, which enables them to be downloaded directly to your client.
Are you using the example IAM policy we provide here (https://github.com/vmware-tanzu/velero-plugin-for-gcp#Set-permissions-for-Velero)? Or are you doing something different?
Hi - Sorry we are using GKE workload_identity here but the GCP service account has the correct permissions.
Hi,
OK a little more information if I use the "old" method with the service account I can get the log/details information with a combination of https_proxy/no_proxy. Without these I get:-
Resource List: <error getting backup resource list: Get https://storage.googleapis.com/<project>-velero-sa/backups/test-backup/test-backup-resource-list.json.gz?Expires=1573719004&GoogleAccessId=svc-velero%40<project>.iam.gserviceaccount.com&Signature=<signature>: dial tcp 108.177.122.128:443: i/o timeout>
But setting the proxy values gets the information just fine.
Using the same values when using a workload_identity based install and I get the timeouts as above ( i did try this previously though).
Thanks.
Perhaps @victortrac @salrashid123 might have some insight here?
i'm not familar w/ valero (product),but could help w/ the workloadidentiy-gcs bit..
If you can provide any additional traces that specifically involve velorys' internal call out to the iam api and/or gcs?
if somehow its using an http/https_proxy to make the conneciton, the gcs and iam library honors should honor the env-vars
These nodes being private seems like it has something to do with it, but it's strange that the nodes are able to write to GCS but not read (based on the timeout error).
@PsychoSid: Do you have NAT enabled or are you purely relying on VPC private access? The IP that you returned (108.177.122.128) is a public IP, so it seems like you're not using private access when workload identity is enabled. Can you try adding a NAT gateway and seeing if the problem goes away?
Hi,
It's purely private access with no NAT we use VPC shared from a host project.
Sorry I don't have the chance to add a NAT as I don't have the access to the host project required to make that change.
The nodes can read from GCS just fine as we use that GSA with workload_identity in another workspace to access from GCS and it seems to work just fine.
Apologies on the fact I am unable to debug further now that I have it working with a service account that's a valid workaround for now so can close if unable to debug/workaround further.
Many thanks for looking into it. If I get the chance to look at in the future I will.
alright, glad you were able to get it working. Feel free to reopen in the future as needed.
Hi. Got the exact same error with velero on GKE and workload identity enabled:
Client:
Version: v1.2.0
Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
Version: v1.2.0
... updated to:
Client:
Version: v1.4.2
Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
Server:
Version: v1.4.0
... but no dice (same issue using both versions), logs output fwiw:
time="2020-08-19T17:50:24Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 403: The caller does not have permission, forbidden" key=velero/default-backup-20200819161457-20200819194942 logSource="pkg/controller/generic_controller.go:137"
Can't velero backup logs test:
An error occurred: timed out waiting for download URL
... or i.e.: velero describe backups test --details:
...
Resource List: <error getting backup resource list: timed out waiting for download URL>
Velero-Native Snapshots: <error getting snapshot info: timed out waiting for download URL>
For future searchers: I had this issue on a GKE private cluster with workload identity enabled as well. Was able to use IAM audit logging to suss out that the velero service account needed the Service Account Token Creator role, because it seems to use the service accounts signBlob API.
I've got the same problem on a GKE cluster (private node, master with authorized network).
I add these permissions :
It works now.
@nlamirault That's amazing. I was looking 45 minutes ago for a reason why velero doesn't show the logs and there you came with your revelations in the same exact time.
I can confirm it works with your settings.
@dex4er in the documentation : https://github.com/vmware-tanzu/velero-plugin-for-gcp#option-2-set-permissions-with-using-workload-identity-optional
but not really highlighted :)
Most helpful comment
For future searchers: I had this issue on a GKE private cluster with workload identity enabled as well. Was able to use IAM audit logging to suss out that the velero service account needed the
Service Account Token Creatorrole, because it seems to use the service accountssignBlobAPI.