Velero: "Error getting backup store for this location"

Created on 27 Feb 2020 · 19Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:

Note: I'm deploying Velero in a GKE private cluster. Images for velero and velero-plugin-for-gcp have been copied over to the internal GCR repo. We're also using WorkloadIdentity.

velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket $BUCKET --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected]

Results in the following error, and the server halts:

An error occurred: some backup storage locations are invalid: error getting backup store for location "default": unable to locate ObjectStore plugin named velero.io/gcp

More details:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  creationTimestamp: 2020-02-24T04:44:56Z
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "3283695"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 68071597-56c0-11ea-91f8-4201ac107009
spec:
  config:
    serviceAccount: [email protected]
  objectStorage:
    bucket: <foo_bucket>
  provider: gcp
status: {}

What did you expect to happen:
Velero up and running ready to receive other instructions.

The output of the following commands will help us better understand what's going on:

kubectl logs deployment/velero -n velero: https://gist.github.com/guilledipa/1c071d053eb9bb57a7f4c1e6e4d56cd8

Anything else you would like to add:

Deleting the default BackupStoreLocation and replacing it with the same definition named as gcp fixes the reported issue, however, new error appear later on:

$ kubectl -n velero delete backupstoragelocation default`

$ kubectl apply -f <(echo -n "
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  labels:
    component: velero
  name: gcp
  namespace: velero
spec:
  config:
    name: gcp
    serviceAccount: [email protected]
  objectStorage:
    bucket: <foo_bucket>
  provider: velero.io/gcp")
backupstoragelocation.velero.io/gcp unchanged

Server starts but other errors appear:

time="2020-02-27T08:40:25Z" level=error msg="Error getting backup store for this location" backupLocation=gcp controller=backup-sync error="unable to locate ObjectStore plugin named velero.io/gcp" logSource="pkg/controller/backup_sync_controller.go:167"

Environment:

Velero version (use velero version):

Client:
    Version: v1.2.0
    Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
    Version: v1.2.0

Velero features (use velero client config get features):

features: <NOT SET>

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.9-gke.9", GitCommit:"a9973cbb2722793e2ea08d20880633ca61d3e669", GitTreeState:"clean", BuildDate:"2020-02-07T22:35:02Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: 1.15.9-gke.9
Cloud provider or hardware configuration: GKE

AreClouGCP ArePlugins Needs info Needs investigation

Source

guilledipa

Most helpful comment

Hey folks, I found the issue (and workaround):

DNS resolution within the pod wasn't working:

nobody@velero-599bf9ff5d-lgtpd:/$ getent hosts metadata.google.internal

I spinned up the workload-identity-test following GCP documetation to test WorkloadIdentity:

$ kubectl -n velero exec  workload-identity-test -it -- /bin/bash

root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
curl: (6) Could not resolve host: metadata.google.internal

So, I patched the Velero Deployment config, changing the dnsPolicy (which was ClusterFirst):

$ kubectl -n velero patch deployment velero -p '{"spec":{"template":{"spec":{"dnsPolicy": "Default"}}}}'

Which fixed the name resolution:

nobody@velero-7b87568cdc-rtsl9:/$ getent hosts metadata.google.internal
169.254.169.254 metadata.google.internal

After this, velero started working successfully using WorkloadIdentity

guilledipa on 12 Mar 2020

🎉2

All 19 comments

Looks like the velero-plugin-for-gcp was not registered or available for velero to call into. To troubleshoot further, can you please share:

Deployment of velero, specifically output of
kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainers
if jq is not available, then
kubectl -n velero get deploy velero -oyaml

Contents of the plugins directory in the velero pod. You can do this by running the below commands:

$ # exec into the running velero pod
$ kubectl -n velero exec -ti deploy/velero bash
$ # inspect contents of the /plugins directory
$ nobody@velero-6758d49c44-7pt5h:/$ pwd
/
nobody@velero-6758d49c44-7pt5h:/$ ls /plugins/

ashish-amarnath on 27 Feb 2020

Here the requested data:

$ kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainers
[
  {
    "image": "gcr.io/foo/velero-plugin-for-gcp:v1.0.0",
    "imagePullPolicy": "IfNotPresent",
    "name": "velero-plugin-for-gcp",
    "resources": {},
    "terminationMessagePath": "/dev/termination-log",
    "terminationMessagePolicy": "File",
    "volumeMounts": [
      {
        "mountPath": "/target",
        "name": "plugins"
      }
    ]
  }
]

For the contents of /plugins because the pod is in CrashLoopBackOff state, I applied the "workaround" suggested in "Anything else you would like to add" section in order to have a running Pod. Here the results:

nobody@velero-7c7f5978b5-jtdrn:/$ ls plugins/
nobody@velero-7c7f5978b5-jtdrn:/$

Thanks for the quick response :)

guilledipa on 27 Feb 2020

As I suspected, the plugins directory is empty. That is the reason velero is not able to discover these plugins resulting in the unable to locate ObjectStore plugin named velero.io/gcp error.
Is gcr.io/foo/velero-plugin-for-gcp:v1.0.0 an image you build? And can you please confirm if you are using this Dockerfile to build the image?
This is what is responsible for copying the plugin binaries to a place where velero can discover it https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/Dockerfile#L27

ashish-amarnath on 27 Feb 2020

gcr.io/foo/velero-plugin-for-gcp:v1.0.0 is just a copy of velero/velero-plugin-for-gcp:v1.0.0 pushed into our own GCR :

$ docker pull velero/velero-plugin-for-gcp:v1.0.0
$ docker tag<ID> gcr.io/foo/velero-plugin-for-gcp:v1.0.0
$ docker push gcr.io/foo/velero-plugin-for-gcp:v1.0.0

I'll verify the health of these images and report back

Thanks very much!

guilledipa on 28 Feb 2020

👍1

The plugin image look correct to me:

$ docker run -it --entrypoint /bin/bash --user nobody  gcr.io/foo/velero-plugin-for-gcp:v1.0.0
nobody@17f0811fc883:/$ ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  plugins  proc  root  run  sbin  srv  sys  tmp  usr  var
nobody@17f0811fc883:/$ ls plugins/
velero-plugin-for-gcp

Now, looking at the velero-plugin-for-gcp Dockerfile, I see that in the last layer we do:
ENTRYPOINT ["/bin/bash", "-c", "cp /plugins/* /target/."]

So, the initContainer is basically moving all within /plugins/* to target/ which is a mountpoint declared in the initContainer spec:

        initContainers:
        - image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
          imagePullPolicy: IfNotPresent
          name: velero-plugin-for-gcp
          resources: {}
          volumeMounts:
          - mountPath: /target
            name: plugins

However, aren't we expecting that the plugin gets deployed in /plugin within the velero POD as per https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-592134355 ?

guilledipa on 28 Feb 2020

Sorry, I see that in the velero container, we're mounting the volume name plugins in /plugins

That said, the /plugins directory is still empty

guilledipa on 28 Feb 2020

We had a broken image in GCR and even though we pushed new ones (healthy), the deployment wasn't pulling the latest ones. I made the following change:

        initContainers:
        - image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
          imagePullPolicy: Always                <----- HERE
          name: velero-plugin-for-gcp
          resources: {}
          volumeMounts:
          - mountPath: /target
            name: plugins

At the moment the velero pod starts up correctly, however, it sits at this point indefinitely:

time="2020-02-28T04:09:32Z" level=info msg="setting log-level to INFO" logSource="pkg/cmd/server/server.go:171"
time="2020-02-28T04:09:32Z" level=info msg="Starting Velero server v1.2.0 (5d008491bbf681658d3e372da1a9d3a21ca4c03c)" logSource="pkg/cmd/server/server.go:173"
time="2020-02-28T04:09:32Z" level=info msg="No feature flags enabled" logSource="pkg/cmd/server/server.go:177"
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pv
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pv-from-pvc
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pvc-from-pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/change-storage-class
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/cluster-role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/job
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/restic
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=VolumeSnapshotter logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=ObjectStore logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-02-28T04:09:32Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-02-28T04:09:35Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-02-28T04:09:35Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-02-28T04:09:35Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"

guilledipa on 28 Feb 2020

Given that we're using Workload Identity, might this be related?

guilledipa on 28 Feb 2020

I am not familiar with how workload identity works.
At this point, velero is trying to connect to the backup storage locations as part of validation.
The reason for no progress could be because the backup storage location my be taking too long to respond or it may be stuck in the process of fetching credentials, if that's what workload identity does.

ashish-amarnath on 28 Feb 2020

It does like the GCP plugins are registering correctly now.

Re: workload identity - we have some docs on this at https://github.com/vmware-tanzu/velero-plugin-for-gcp#option-2-set-permissions-with-using-workload-identity-optional and https://github.com/vmware-tanzu/velero-plugin-for-gcp#install-and-start-velero - maybe take another look and ensure everything's configured correctly?

skriss on 28 Feb 2020

Thanks folks!

I double-checked the documentation and our configs LGTM:

Here the IAM policies:

$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
  - serviceAccount:foo.svc.id.goog[velero/velero]
  role: roles/iam.workloadIdentityUser

This is the list of permissions assigned to the velero IAM service account serviceAccount:[email protected]:

- compute.disks.get
- compute.disks.create
- compute.disks.createSnapshot
- compute.snapshots.get
- compute.snapshots.create
- compute.snapshots.useReadOnly
- compute.snapshots.delete
- compute.zones.get

Here the ServiceAccount annotation:

$ kubectl -n velero get serviceaccount velero -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"iam.gke.io/gcp-service-account":"[email protected]"},"creationTimestamp":null,"labels":{"component":"velero"},"name":"velero","namespace":"velero"}}
  creationTimestamp: "2020-02-28T04:09:29Z"
  labels:
    component: velero
  name: velero
  namespace: velero
  resourceVersion: "4807231"
  selfLink: /api/v1/namespaces/velero/serviceaccounts/velero
  uid: 29925587-17d1-42a7-9d7d-52b0f78b523c
secrets:
- name: velero-token-jrqdf

Here the BSL defijition:

$ kubectl -n velero get BackupStorageLocation default -o yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"velero.io/v1","kind":"BackupStorageLocation","metadata":{"annotations":{},"creationTimestamp":null,"labels":{"component":"velero"},"name":"default","namespace":"velero"},"spec":{"config":{"serviceAccount":"[email protected]"},"objectStorage":{"bucket":"velero-foo"},"provider":"gcp"}}
  creationTimestamp: "2020-02-28T04:09:29Z"
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "4807233"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 7d05d762-38e6-4aba-9092-ce02b8398a9e
spec:
  config:
    serviceAccount: [email protected]
  objectStorage:
    bucket: velero-foo
  provider: gcp

Is there any way I can bump-up the logging to understand where velero is failing to fetch credentials?

guilledipa on 3 Mar 2020

Hey folks, any ideas on what might be going on?

Please let me know if I can provide more information.

Cheers,

guilledipa on 6 Mar 2020

Hi @guilledipa sorry about the delay in responding here. Can you please share how you are installing Velero, if you are using the velero install cli then can you please share the command you are using?

Also, yaml for your velero deployment would be useful.

ashish-amarnath on 6 Mar 2020

Thanks @ashish-amarnath! No worries at all :)

We use Anthos to manage the objects in GKE. Therefore we use velero install ... --dry-run -o yaml to generate this file:
https://gist.github.com/guilledipa/25a64d86bedf8c2364f28db302c707da
(which is enforced by Anthos).

The command is:

velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket velero-foo --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected] --dry-run -o yaml

guilledipa on 6 Mar 2020

👀1

This is what the velero-plugin-for gcp may be waiting on https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L78
Most likely waiting here https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L139

I'd suggest adding some log messages around that code to investigate. I can build a custom image with instrumentation for you to try. It is almost EOD for me today. I can do this first up tomorrow.

ashish-amarnath on 6 Mar 2020

Also suggest checking steps 5,6,7 and 8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity in case you've not already done that.

ashish-amarnath on 6 Mar 2020

Thank a lot @ashish-amarnath! Happy to test the new image when it's available.

Regarding the steps 5,6,7,8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity, we have this setup correctly:

5, 6, 8. Verified in https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-593683303

IAM bindings:

$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
  - serviceAccount:foo.svc.id.goog[velero/velero]
  role: roles/iam.workloadIdentityUser
etag: BwWe4OomD8Y=
version: 1

guilledipa on 6 Mar 2020

Hi @ashish-amarnath , Just wanted to let you know that for completeness, I tried running velero:v1.3.1 and velero-plugin-for-gcp:v1.0.1 but unfortunately, I got to the same state.

guilledipa on 11 Mar 2020

Hey folks, I found the issue (and workaround):

DNS resolution within the pod wasn't working:

nobody@velero-599bf9ff5d-lgtpd:/$ getent hosts metadata.google.internal

I spinned up the workload-identity-test following GCP documetation to test WorkloadIdentity:

$ kubectl -n velero exec  workload-identity-test -it -- /bin/bash

root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
curl: (6) Could not resolve host: metadata.google.internal

So, I patched the Velero Deployment config, changing the dnsPolicy (which was ClusterFirst):

$ kubectl -n velero patch deployment velero -p '{"spec":{"template":{"spec":{"dnsPolicy": "Default"}}}}'

Which fixed the name resolution:

nobody@velero-7b87568cdc-rtsl9:/$ getent hosts metadata.google.internal
169.254.169.254 metadata.google.internal

After this, velero started working successfully using WorkloadIdentity

guilledipa on 12 Mar 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Consider broadening ark service account permissions to cluster admin

ncdc · 3Comments

Provide multi-arch Docker images for AWS plugin

onedr0p · 3Comments

Creating or updating a schedule immediately starts a backup.

Yggdrasil · 3Comments

Volume snapshots don't include informative metadata

totemcaf · 4Comments

Add bash completion support

concaf · 3Comments