Velero: Timeouts Waiting for Download URL

Created on 13 Nov 2019 · 28Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
Attempting to access detailed status of a backup, or review the logs.

Timeout's are issued.

What did you expect to happen:
Both to be displayed as required.

The output of the following commands will help us better understand what's going on:

 velero backup describe daily-backup-c006-20191112133437 --details
Name:         daily-backup-c006-20191112133437
Namespace:    velero
Labels:       velero.io/backup=daily-backup-c006-20191112133437
              velero.io/pv=pvc-5ce737ae-fd4c-11e9-8c44-42010a3ce057
              velero.io/schedule-name=daily-backup-c006
              velero.io/storage-location=default
Annotations:  <none>

Phase:  PartiallyFailed (run `velero backup logs daily-backup-c006-20191112133437` for more information)

Errors:    1
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Snapshot PVs:  auto

TTL:  120h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2019-11-12 06:34:37 -0700 MST
Completed:  2019-11-12 06:35:07 -0700 MST

Expiration:  2019-11-17 06:34:37 -0700 MST

Resource List:  <error getting backup resource list: timed out waiting for download URL>

Persistent Volumes:  <error getting volume snapshot info: timed out waiting for download URL>

Restic Backups:
  Completed:
    database/postgresql-postgresql-0: data
  Failed:
    database/mariadb-0: data

And

 velero backup logs daily-backup-c006-20191112133437
An error occurred: timed out waiting for download URL

Anything else you would like to add:
This a private GKE cluster.

Environment:

velero version
Client:
        Version: v1.2.0
        Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
        Version: v1.2.0

kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-gke.9", GitCommit:"6c1e92d07f5717440f751666d4aad6943015d3cb", GitTreeState:"clean", BuildDate:"2019-10-11T23:14:17Z", GoVersion:"go1.11.13b4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: GKE (Private)

AreClouGCP Question

Source

PsychoSid

Most helpful comment

For future searchers: I had this issue on a GKE private cluster with workload identity enabled as well. Was able to use IAM audit logging to suss out that the velero service account needed the Service Account Token Creator role, because it seems to use the service accounts signBlob API.

rocketraman on 22 Sep 2020

👍5 🚀1

All 28 comments

This also seems to cause lots of these and failures to backup PV's

time="2019-11-13T10:54:12Z" level=error msg="Error backing up item" backup=velero/daily-backup-c007-20191113105351 error="pod volume backup failed: error running restic backup, stderr=Fatal: invalid id \"8860b8f7\": no matching ID found\n: unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:182" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:288" name=cseven-zookeeper-2 namespace= resource=pods

PsychoSid on 13 Nov 2019

The An error occurred: timed out waiting for download URL error usually indicates the Velero server isn't running properly. Can you check if the pod is running (kubectl -n velero get pods), and also get the logs for it? (kubectl -n velero logs deploy/velero)

skriss on 13 Nov 2019

I did check those they all seem fine, and backups are taken and data shows in the bucket.

kc get pods -n velero
NAME                     READY   STATUS    RESTARTS   AGE
restic-4cdr9             1/1     Running   0          26h
restic-67tzb             1/1     Running   0          26h
restic-9bp4w             1/1     Running   0          26h
restic-x7ljs             1/1     Running   0          26h
velero-75bd7659b-g2kx4   1/1     Running   0          26h

As mentioned above there is normal logs but a lot of errors which are seemingly tied in

time="2019-11-13T15:20:51Z" level=info msg="Checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:456"
time="2019-11-13T15:20:51Z" level=info msg="Done checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:484"
time="2019-11-13T15:33:42Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021123 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021145 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021204 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:44Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021216 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111085328 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111091931 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:45Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021003 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:46Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021028 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:46Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021333 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021058 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021107 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191112021114 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T15:33:47Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/test-
backup-20191111085352 logSource="pkg/controller/generic_controller.go:137"

Thanks.

PsychoSid on 13 Nov 2019

Hmm, seems like maybe there are some orphaned invalid requests in there. Can you kubectl -n velero delete downloadrequests.velero.io --all, then try running a logs or describe --details command again?

skriss on 13 Nov 2019

👍1

There was some

kubectl -n velero delete downloadrequests.velero.io --all
downloadrequest.velero.io "test-backup-20191111085328" deleted
downloadrequest.velero.io "test-backup-20191111085352" deleted
downloadrequest.velero.io "test-backup-20191111091931" deleted
downloadrequest.velero.io "test-backup-20191112021003" deleted
downloadrequest.velero.io "test-backup-20191112021028" deleted
downloadrequest.velero.io "test-backup-20191112021058" deleted
downloadrequest.velero.io "test-backup-20191112021107" deleted
downloadrequest.velero.io "test-backup-20191112021114" deleted
downloadrequest.velero.io "test-backup-20191112021123" deleted
downloadrequest.velero.io "test-backup-20191112021145" deleted
downloadrequest.velero.io "test-backup-20191112021204" deleted
downloadrequest.velero.io "test-backup-20191112021216" deleted
downloadrequest.velero.io "test-backup-20191112021333" deleted

But the issues with --details still remain.

Resource List:  <error getting backup resource list: timed out waiting for download URL>

Persistent Volumes:  <error getting volume snapshot info: timed out waiting for download URL>

PsychoSid on 13 Nov 2019

But the errors in the logs have ceased :)

PsychoSid on 13 Nov 2019

Is there anything in the logs from when you just ran the --details cmd?

skriss on 13 Nov 2019

Ah apologies yes those errors appear again during that command:-

time="2019-11-13T16:03:10Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/daily-backup-c007-20191113105351-20191113090303 logSource="pkg/controller/generic_controller.go:137"
time="2019-11-13T16:03:16Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 404: Requested entity was not found., notFound" key=velero/daily-backup-c007-20191113105351-20191113090303 logSource="pkg/controller/generic_controller.go:137"

PsychoSid on 13 Nov 2019

Can you look in the actual GCS bucket (I assume you're using GCS?) and see what files are in there under backups/<backup-name>?

skriss on 13 Nov 2019

All looks OK to me

 gsutil ls gs://<project>-velero/backups/daily-backup-c007-20191113105351/
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-logs.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-podvolumebackups.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-resource-list.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351-volumesnapshots.json.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/daily-backup-c007-20191113105351.tar.gz
gs://<project>-velero/backups/daily-backup-c007-20191113105351/velero-backup.json

PsychoSid on 13 Nov 2019

OK...looks like the files are there - I'm not seeing any indication of why you'd be getting 404's.

How about kubectl -n velero get downloadrequests.velero.io -o yaml?

skriss on 13 Nov 2019

This shows:-

kubectl -n velero get downloadrequests.velero.io -o yaml
apiVersion: v1
items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

PsychoSid on 13 Nov 2019

OK, how about running another logs/describe --details command, seeing if there's an error, and if so, getting the pod logs and running kubectl -n velero get downloadrequests.velero.io -o yaml?

skriss on 13 Nov 2019

So the logs show the 404's as before when running a describe --details and the k8s object shows

apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: DownloadRequest
  metadata:
    creationTimestamp: "2019-11-13T16:16:46Z"
    generation: 1
    name: daily-backup-c007-20191113105351-20191113091646
    namespace: velero
    resourceVersion: "9162134"
    selfLink: /apis/velero.io/v1/namespaces/velero/downloadrequests/daily-backup-c007-20191113105351-20191113091646
    uid: fcfea37d-0630-11ea-b190-42010a3ce069
  spec:
    target:
      kind: BackupVolumeSnapshots
      name: daily-backup-c007-20191113105351
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: "

PsychoSid on 13 Nov 2019

Which is curious as I ran velero backup describe daily-backup-c007-20191113105351 --details

PsychoSid on 13 Nov 2019

I'm not seeing anything wrong here. Velero is supposed to be able to get a pre-signed URL for these objects in object storage, which enables them to be downloaded directly to your client.

Are you using the example IAM policy we provide here (https://github.com/vmware-tanzu/velero-plugin-for-gcp#Set-permissions-for-Velero)? Or are you doing something different?

skriss on 13 Nov 2019

Hi - Sorry we are using GKE workload_identity here but the GCP service account has the correct permissions.

PsychoSid on 14 Nov 2019

Hi,

OK a little more information if I use the "old" method with the service account I can get the log/details information with a combination of https_proxy/no_proxy. Without these I get:-

Resource List:  <error getting backup resource list: Get https://storage.googleapis.com/<project>-velero-sa/backups/test-backup/test-backup-resource-list.json.gz?Expires=1573719004&GoogleAccessId=svc-velero%40<project>.iam.gserviceaccount.com&Signature=<signature>: dial tcp 108.177.122.128:443: i/o timeout>

But setting the proxy values gets the information just fine.
Using the same values when using a workload_identity based install and I get the timeouts as above ( i did try this previously though).

Thanks.

PsychoSid on 14 Nov 2019

Perhaps @victortrac @salrashid123 might have some insight here?

skriss on 14 Nov 2019

i'm not familar w/ valero (product),but could help w/ the workloadidentiy-gcs bit..

If you can provide any additional traces that specifically involve velorys' internal call out to the iam api and/or gcs?

if somehow its using an http/https_proxy to make the conneciton, the gcs and iam library honors should honor the env-vars

salrashid123 on 14 Nov 2019

These nodes being private seems like it has something to do with it, but it's strange that the nodes are able to write to GCS but not read (based on the timeout error).

@PsychoSid: Do you have NAT enabled or are you purely relying on VPC private access? The IP that you returned (108.177.122.128) is a public IP, so it seems like you're not using private access when workload identity is enabled. Can you try adding a NAT gateway and seeing if the problem goes away?

victortrac on 18 Nov 2019

Hi,

It's purely private access with no NAT we use VPC shared from a host project.

Sorry I don't have the chance to add a NAT as I don't have the access to the host project required to make that change.

The nodes can read from GCS just fine as we use that GSA with workload_identity in another workspace to access from GCS and it seems to work just fine.

Apologies on the fact I am unable to debug further now that I have it working with a service account that's a valid workaround for now so can close if unable to debug/workaround further.

Many thanks for looking into it. If I get the chance to look at in the future I will.

PsychoSid on 19 Nov 2019

alright, glad you were able to get it working. Feel free to reopen in the future as needed.

skriss on 20 Nov 2019

Hi. Got the exact same error with velero on GKE and workload identity enabled:

Client:
    Version: v1.2.0
    Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
    Version: v1.2.0

... updated to:

Client:
    Version: v1.4.2
    Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
Server:
    Version: v1.4.0

... but no dice (same issue using both versions), logs output fwiw:

time="2020-08-19T17:50:24Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=downloadrequest error="rpc error: code = Unknown desc = googleapi: Error 403: The caller does not have permission, forbidden" key=velero/default-backup-20200819161457-20200819194942 logSource="pkg/controller/generic_controller.go:137"

Can't velero backup logs test:

An error occurred: timed out waiting for download URL

... or i.e.: velero describe backups test --details:

...
Resource List:  <error getting backup resource list: timed out waiting for download URL>

Velero-Native Snapshots:  <error getting snapshot info: timed out waiting for download URL>

masseybradley on 19 Aug 2020

rocketraman on 22 Sep 2020

👍5 🚀1

I've got the same problem on a GKE cluster (private node, master with authorized network).
I add these permissions :

iam.serviceAccounts.signBlob

It works now.

nlamirault on 19 Apr 2021

🎉1

@nlamirault That's amazing. I was looking 45 minutes ago for a reason why velero doesn't show the logs and there you came with your revelations in the same exact time.

I can confirm it works with your settings.

dex4er on 19 Apr 2021

@dex4er in the documentation : https://github.com/vmware-tanzu/velero-plugin-for-gcp#option-2-set-permissions-with-using-workload-identity-optional

but not really highlighted :)

nlamirault on 19 Apr 2021

😄1

Was this page helpful?

0 / 5 - 0 ratings