Velero: "Error getting backup store for this location"

Created on 27 Feb 2020  路  19Comments  路  Source: vmware-tanzu/velero

What steps did you take and what happened:

Note: I'm deploying Velero in a GKE private cluster. Images for velero and velero-plugin-for-gcp have been copied over to the internal GCR repo. We're also using WorkloadIdentity.

velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket $BUCKET --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected]

Results in the following error, and the server halts:

An error occurred: some backup storage locations are invalid: error getting backup store for location "default": unable to locate ObjectStore plugin named velero.io/gcp

More details:

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  creationTimestamp: 2020-02-24T04:44:56Z
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "3283695"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 68071597-56c0-11ea-91f8-4201ac107009
spec:
  config:
    serviceAccount: [email protected]
  objectStorage:
    bucket: <foo_bucket>
  provider: gcp
status: {}

What did you expect to happen:
Velero up and running ready to receive other instructions.

The output of the following commands will help us better understand what's going on:

Anything else you would like to add:

Deleting the default BackupStoreLocation and replacing it with the same definition named as gcp fixes the reported issue, however, new error appear later on:

$ kubectl -n velero delete backupstoragelocation default`

$ kubectl apply -f <(echo -n "
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  labels:
    component: velero
  name: gcp
  namespace: velero
spec:
  config:
    name: gcp
    serviceAccount: [email protected]
  objectStorage:
    bucket: <foo_bucket>
  provider: velero.io/gcp")
backupstoragelocation.velero.io/gcp unchanged

Server starts but other errors appear:

time="2020-02-27T08:40:25Z" level=error msg="Error getting backup store for this location" backupLocation=gcp controller=backup-sync error="unable to locate ObjectStore plugin named velero.io/gcp" logSource="pkg/controller/backup_sync_controller.go:167"

Environment:

  • Velero version (use velero version):
Client:
    Version: v1.2.0
    Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
    Version: v1.2.0
  • Velero features (use velero client config get features):
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.9-gke.9", GitCommit:"a9973cbb2722793e2ea08d20880633ca61d3e669", GitTreeState:"clean", BuildDate:"2020-02-07T22:35:02Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: 1.15.9-gke.9

  • Cloud provider or hardware configuration: GKE

AreClouGCP ArePlugins Needs info Needs investigation

Most helpful comment

Hey folks, I found the issue (and workaround):

DNS resolution within the pod wasn't working:

nobody@velero-599bf9ff5d-lgtpd:/$ getent hosts metadata.google.internal

I spinned up the workload-identity-test following GCP documetation to test WorkloadIdentity:

$ kubectl -n velero exec  workload-identity-test -it -- /bin/bash

root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
curl: (6) Could not resolve host: metadata.google.internal

So, I patched the Velero Deployment config, changing the dnsPolicy (which was ClusterFirst):

$ kubectl -n velero patch deployment velero -p '{"spec":{"template":{"spec":{"dnsPolicy": "Default"}}}}'

Which fixed the name resolution:

nobody@velero-7b87568cdc-rtsl9:/$ getent hosts metadata.google.internal
169.254.169.254 metadata.google.internal

After this, velero started working successfully using WorkloadIdentity

All 19 comments

Looks like the velero-plugin-for-gcp was not registered or available for velero to call into. To troubleshoot further, can you please share:

  • Deployment of velero, specifically output of
    kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainers
    if jq is not available, then
    kubectl -n velero get deploy velero -oyaml
  • Contents of the plugins directory in the velero pod. You can do this by running the below commands:
$ # exec into the running velero pod
$ kubectl -n velero exec -ti deploy/velero bash
$ # inspect contents of the /plugins directory
$ nobody@velero-6758d49c44-7pt5h:/$ pwd
/
nobody@velero-6758d49c44-7pt5h:/$ ls /plugins/

Here the requested data:

$ kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainers
[
  {
    "image": "gcr.io/foo/velero-plugin-for-gcp:v1.0.0",
    "imagePullPolicy": "IfNotPresent",
    "name": "velero-plugin-for-gcp",
    "resources": {},
    "terminationMessagePath": "/dev/termination-log",
    "terminationMessagePolicy": "File",
    "volumeMounts": [
      {
        "mountPath": "/target",
        "name": "plugins"
      }
    ]
  }
]

For the contents of /plugins because the pod is in CrashLoopBackOff state, I applied the "workaround" suggested in "Anything else you would like to add" section in order to have a running Pod. Here the results:

nobody@velero-7c7f5978b5-jtdrn:/$ ls plugins/
nobody@velero-7c7f5978b5-jtdrn:/$

Thanks for the quick response :)

As I suspected, the plugins directory is empty. That is the reason velero is not able to discover these plugins resulting in the unable to locate ObjectStore plugin named velero.io/gcp error.
Is gcr.io/foo/velero-plugin-for-gcp:v1.0.0 an image you build? And can you please confirm if you are using this Dockerfile to build the image?
This is what is responsible for copying the plugin binaries to a place where velero can discover it https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/Dockerfile#L27

gcr.io/foo/velero-plugin-for-gcp:v1.0.0 is just a copy of velero/velero-plugin-for-gcp:v1.0.0 pushed into our own GCR :

$ docker pull velero/velero-plugin-for-gcp:v1.0.0
$ docker tag<ID> gcr.io/foo/velero-plugin-for-gcp:v1.0.0
$ docker push gcr.io/foo/velero-plugin-for-gcp:v1.0.0

I'll verify the health of these images and report back

Thanks very much!

The plugin image look correct to me:

$ docker run -it --entrypoint /bin/bash --user nobody  gcr.io/foo/velero-plugin-for-gcp:v1.0.0
nobody@17f0811fc883:/$ ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  plugins  proc  root  run  sbin  srv  sys  tmp  usr  var
nobody@17f0811fc883:/$ ls plugins/
velero-plugin-for-gcp

Now, looking at the velero-plugin-for-gcp Dockerfile, I see that in the last layer we do:
ENTRYPOINT ["/bin/bash", "-c", "cp /plugins/* /target/."]

So, the initContainer is basically moving all within /plugins/* to target/ which is a mountpoint declared in the initContainer spec:

        initContainers:
        - image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
          imagePullPolicy: IfNotPresent
          name: velero-plugin-for-gcp
          resources: {}
          volumeMounts:
          - mountPath: /target
            name: plugins

However, aren't we expecting that the plugin gets deployed in /plugin within the velero POD as per https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-592134355 ?

Sorry, I see that in the velero container, we're mounting the volume name plugins in /plugins

That said, the /plugins directory is still empty

We had a broken image in GCR and even though we pushed new ones (healthy), the deployment wasn't pulling the latest ones. I made the following change:

        initContainers:
        - image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
          imagePullPolicy: Always                <----- HERE
          name: velero-plugin-for-gcp
          resources: {}
          volumeMounts:
          - mountPath: /target
            name: plugins

At the moment the velero pod starts up correctly, however, it sits at this point indefinitely:

time="2020-02-28T04:09:32Z" level=info msg="setting log-level to INFO" logSource="pkg/cmd/server/server.go:171"
time="2020-02-28T04:09:32Z" level=info msg="Starting Velero server v1.2.0 (5d008491bbf681658d3e372da1a9d3a21ca4c03c)" logSource="pkg/cmd/server/server.go:173"
time="2020-02-28T04:09:32Z" level=info msg="No feature flags enabled" logSource="pkg/cmd/server/server.go:177"
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pv
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pv-from-pvc
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pvc-from-pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/change-storage-class
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/cluster-role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/job
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/restic
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=VolumeSnapshotter logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=ObjectStore logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-02-28T04:09:32Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-02-28T04:09:35Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-02-28T04:09:35Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-02-28T04:09:35Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"

Given that we're using Workload Identity, might this be related?

I am not familiar with how workload identity works.
At this point, velero is trying to connect to the backup storage locations as part of validation.
The reason for no progress could be because the backup storage location my be taking too long to respond or it may be stuck in the process of fetching credentials, if that's what workload identity does.

It does like the GCP plugins are registering correctly now.

Re: workload identity - we have some docs on this at https://github.com/vmware-tanzu/velero-plugin-for-gcp#option-2-set-permissions-with-using-workload-identity-optional and https://github.com/vmware-tanzu/velero-plugin-for-gcp#install-and-start-velero - maybe take another look and ensure everything's configured correctly?

Thanks folks!

I double-checked the documentation and our configs LGTM:

Here the IAM policies:

$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
  - serviceAccount:foo.svc.id.goog[velero/velero]
  role: roles/iam.workloadIdentityUser

This is the list of permissions assigned to the velero IAM service account serviceAccount:[email protected]:

- compute.disks.get
- compute.disks.create
- compute.disks.createSnapshot
- compute.snapshots.get
- compute.snapshots.create
- compute.snapshots.useReadOnly
- compute.snapshots.delete
- compute.zones.get

Here the ServiceAccount annotation:

$ kubectl -n velero get serviceaccount velero -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: [email protected]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"iam.gke.io/gcp-service-account":"[email protected]"},"creationTimestamp":null,"labels":{"component":"velero"},"name":"velero","namespace":"velero"}}
  creationTimestamp: "2020-02-28T04:09:29Z"
  labels:
    component: velero
  name: velero
  namespace: velero
  resourceVersion: "4807231"
  selfLink: /api/v1/namespaces/velero/serviceaccounts/velero
  uid: 29925587-17d1-42a7-9d7d-52b0f78b523c
secrets:
- name: velero-token-jrqdf

Here the BSL defijition:

$ kubectl -n velero get BackupStorageLocation default -o yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"velero.io/v1","kind":"BackupStorageLocation","metadata":{"annotations":{},"creationTimestamp":null,"labels":{"component":"velero"},"name":"default","namespace":"velero"},"spec":{"config":{"serviceAccount":"[email protected]"},"objectStorage":{"bucket":"velero-foo"},"provider":"gcp"}}
  creationTimestamp: "2020-02-28T04:09:29Z"
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "4807233"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 7d05d762-38e6-4aba-9092-ce02b8398a9e
spec:
  config:
    serviceAccount: [email protected]
  objectStorage:
    bucket: velero-foo
  provider: gcp

Is there any way I can bump-up the logging to understand where velero is failing to fetch credentials?

Hey folks, any ideas on what might be going on?

Please let me know if I can provide more information.

Cheers,

Hi @guilledipa sorry about the delay in responding here. Can you please share how you are installing Velero, if you are using the velero install cli then can you please share the command you are using?

Also, yaml for your velero deployment would be useful.

Thanks @ashish-amarnath! No worries at all :)

We use Anthos to manage the objects in GKE. Therefore we use velero install ... --dry-run -o yaml to generate this file:
https://gist.github.com/guilledipa/25a64d86bedf8c2364f28db302c707da
(which is enforced by Anthos).

The command is:

velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket velero-foo --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected] --dry-run -o yaml

This is what the velero-plugin-for gcp may be waiting on https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L78
Most likely waiting here https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L139

I'd suggest adding some log messages around that code to investigate. I can build a custom image with instrumentation for you to try. It is almost EOD for me today. I can do this first up tomorrow.

Also suggest checking steps 5,6,7 and 8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity in case you've not already done that.

Thank a lot @ashish-amarnath! Happy to test the new image when it's available.

Regarding the steps 5,6,7,8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity, we have this setup correctly:

5, 6, 8. Verified in https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-593683303

  1. IAM bindings:
$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
  - serviceAccount:foo.svc.id.goog[velero/velero]
  role: roles/iam.workloadIdentityUser
etag: BwWe4OomD8Y=
version: 1

Hi @ashish-amarnath , Just wanted to let you know that for completeness, I tried running velero:v1.3.1 and velero-plugin-for-gcp:v1.0.1 but unfortunately, I got to the same state.

Hey folks, I found the issue (and workaround):

DNS resolution within the pod wasn't working:

nobody@velero-599bf9ff5d-lgtpd:/$ getent hosts metadata.google.internal

I spinned up the workload-identity-test following GCP documetation to test WorkloadIdentity:

$ kubectl -n velero exec  workload-identity-test -it -- /bin/bash

root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
curl: (6) Could not resolve host: metadata.google.internal

So, I patched the Velero Deployment config, changing the dnsPolicy (which was ClusterFirst):

$ kubectl -n velero patch deployment velero -p '{"spec":{"template":{"spec":{"dnsPolicy": "Default"}}}}'

Which fixed the name resolution:

nobody@velero-7b87568cdc-rtsl9:/$ getent hosts metadata.google.internal
169.254.169.254 metadata.google.internal

After this, velero started working successfully using WorkloadIdentity

Was this page helpful?
0 / 5 - 0 ratings