What steps did you take and what happened:
Note: I'm deploying Velero in a GKE private cluster. Images for velero and velero-plugin-for-gcp have been copied over to the internal GCR repo. We're also using WorkloadIdentity.
velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket $BUCKET --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected]
Results in the following error, and the server halts:
An error occurred: some backup storage locations are invalid: error getting backup store for location "default": unable to locate ObjectStore plugin named velero.io/gcp
More details:
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
creationTimestamp: 2020-02-24T04:44:56Z
generation: 1
labels:
component: velero
name: default
namespace: velero
resourceVersion: "3283695"
selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
uid: 68071597-56c0-11ea-91f8-4201ac107009
spec:
config:
serviceAccount: [email protected]
objectStorage:
bucket: <foo_bucket>
provider: gcp
status: {}
What did you expect to happen:
Velero up and running ready to receive other instructions.
The output of the following commands will help us better understand what's going on:
kubectl logs deployment/velero -n velero: https://gist.github.com/guilledipa/1c071d053eb9bb57a7f4c1e6e4d56cd8Anything else you would like to add:
Deleting the default BackupStoreLocation and replacing it with the same definition named as gcp fixes the reported issue, however, new error appear later on:
$ kubectl -n velero delete backupstoragelocation default`
$ kubectl apply -f <(echo -n "
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
labels:
component: velero
name: gcp
namespace: velero
spec:
config:
name: gcp
serviceAccount: [email protected]
objectStorage:
bucket: <foo_bucket>
provider: velero.io/gcp")
backupstoragelocation.velero.io/gcp unchanged
Server starts but other errors appear:
time="2020-02-27T08:40:25Z" level=error msg="Error getting backup store for this location" backupLocation=gcp controller=backup-sync error="unable to locate ObjectStore plugin named velero.io/gcp" logSource="pkg/controller/backup_sync_controller.go:167"
Environment:
velero version): Client:
Version: v1.2.0
Git commit: 5d008491bbf681658d3e372da1a9d3a21ca4c03c
Server:
Version: v1.2.0
velero client config get features): features: <NOT SET>
kubectl version):Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.9-gke.9", GitCommit:"a9973cbb2722793e2ea08d20880633ca61d3e669", GitTreeState:"clean", BuildDate:"2020-02-07T22:35:02Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version: 1.15.9-gke.9
Cloud provider or hardware configuration: GKE
Looks like the velero-plugin-for-gcp was not registered or available for velero to call into. To troubleshoot further, can you please share:
kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainersjq is not available, thenkubectl -n velero get deploy velero -oyaml$ # exec into the running velero pod
$ kubectl -n velero exec -ti deploy/velero bash
$ # inspect contents of the /plugins directory
$ nobody@velero-6758d49c44-7pt5h:/$ pwd
/
nobody@velero-6758d49c44-7pt5h:/$ ls /plugins/
Here the requested data:
$ kubectl -n velero get deploy velero -ojson | jq .spec.template.spec.initContainers
[
{
"image": "gcr.io/foo/velero-plugin-for-gcp:v1.0.0",
"imagePullPolicy": "IfNotPresent",
"name": "velero-plugin-for-gcp",
"resources": {},
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"volumeMounts": [
{
"mountPath": "/target",
"name": "plugins"
}
]
}
]
For the contents of /plugins because the pod is in CrashLoopBackOff state, I applied the "workaround" suggested in "Anything else you would like to add" section in order to have a running Pod. Here the results:
nobody@velero-7c7f5978b5-jtdrn:/$ ls plugins/
nobody@velero-7c7f5978b5-jtdrn:/$
Thanks for the quick response :)
As I suspected, the plugins directory is empty. That is the reason velero is not able to discover these plugins resulting in the unable to locate ObjectStore plugin named velero.io/gcp error.
Is gcr.io/foo/velero-plugin-for-gcp:v1.0.0 an image you build? And can you please confirm if you are using this Dockerfile to build the image?
This is what is responsible for copying the plugin binaries to a place where velero can discover it https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/Dockerfile#L27
gcr.io/foo/velero-plugin-for-gcp:v1.0.0 is just a copy of velero/velero-plugin-for-gcp:v1.0.0 pushed into our own GCR :
$ docker pull velero/velero-plugin-for-gcp:v1.0.0
$ docker tag<ID> gcr.io/foo/velero-plugin-for-gcp:v1.0.0
$ docker push gcr.io/foo/velero-plugin-for-gcp:v1.0.0
I'll verify the health of these images and report back
Thanks very much!
The plugin image look correct to me:
$ docker run -it --entrypoint /bin/bash --user nobody gcr.io/foo/velero-plugin-for-gcp:v1.0.0
nobody@17f0811fc883:/$ ls
bin boot dev etc home lib lib64 media mnt opt plugins proc root run sbin srv sys tmp usr var
nobody@17f0811fc883:/$ ls plugins/
velero-plugin-for-gcp
Now, looking at the velero-plugin-for-gcp Dockerfile, I see that in the last layer we do:
ENTRYPOINT ["/bin/bash", "-c", "cp /plugins/* /target/."]
So, the initContainer is basically moving all within /plugins/* to target/ which is a mountpoint declared in the initContainer spec:
initContainers:
- image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
imagePullPolicy: IfNotPresent
name: velero-plugin-for-gcp
resources: {}
volumeMounts:
- mountPath: /target
name: plugins
However, aren't we expecting that the plugin gets deployed in /plugin within the velero POD as per https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-592134355 ?
Sorry, I see that in the velero container, we're mounting the volume name plugins in /plugins
That said, the /plugins directory is still empty
We had a broken image in GCR and even though we pushed new ones (healthy), the deployment wasn't pulling the latest ones. I made the following change:
initContainers:
- image: gcr.io/jirahw-gcr/velero-plugin-for-gcp:v1.0.0
imagePullPolicy: Always <----- HERE
name: velero-plugin-for-gcp
resources: {}
volumeMounts:
- mountPath: /target
name: plugins
At the moment the velero pod starts up correctly, however, it sits at this point indefinitely:
time="2020-02-28T04:09:32Z" level=info msg="setting log-level to INFO" logSource="pkg/cmd/server/server.go:171"
time="2020-02-28T04:09:32Z" level=info msg="Starting Velero server v1.2.0 (5d008491bbf681658d3e372da1a9d3a21ca4c03c)" logSource="pkg/cmd/server/server.go:173"
time="2020-02-28T04:09:32Z" level=info msg="No feature flags enabled" logSource="pkg/cmd/server/server.go:177"
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pv
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=BackupItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pv-from-pvc
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/add-pvc-from-pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/change-storage-class
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/cluster-role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/job
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/pod
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/restic
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/role-bindings
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/velero kind=RestoreItemAction logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/service-account
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=VolumeSnapshotter logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="registering plugin" command=/plugins/velero-plugin-for-gcp kind=ObjectStore logSource="pkg/plugin/clientmgmt/registry.go:100" name=velero.io/gcp
time="2020-02-28T04:09:32Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-02-28T04:09:32Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-02-28T04:09:35Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-02-28T04:09:35Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-02-28T04:09:35Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
Given that we're using Workload Identity, might this be related?
I am not familiar with how workload identity works.
At this point, velero is trying to connect to the backup storage locations as part of validation.
The reason for no progress could be because the backup storage location my be taking too long to respond or it may be stuck in the process of fetching credentials, if that's what workload identity does.
It does like the GCP plugins are registering correctly now.
Re: workload identity - we have some docs on this at https://github.com/vmware-tanzu/velero-plugin-for-gcp#option-2-set-permissions-with-using-workload-identity-optional and https://github.com/vmware-tanzu/velero-plugin-for-gcp#install-and-start-velero - maybe take another look and ensure everything's configured correctly?
Thanks folks!
I double-checked the documentation and our configs LGTM:
Here the IAM policies:
$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
- serviceAccount:foo.svc.id.goog[velero/velero]
role: roles/iam.workloadIdentityUser
This is the list of permissions assigned to the velero IAM service account serviceAccount:[email protected]:
- compute.disks.get
- compute.disks.create
- compute.disks.createSnapshot
- compute.snapshots.get
- compute.snapshots.create
- compute.snapshots.useReadOnly
- compute.snapshots.delete
- compute.zones.get
Here the ServiceAccount annotation:
$ kubectl -n velero get serviceaccount velero -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
iam.gke.io/gcp-service-account: [email protected]
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{"iam.gke.io/gcp-service-account":"[email protected]"},"creationTimestamp":null,"labels":{"component":"velero"},"name":"velero","namespace":"velero"}}
creationTimestamp: "2020-02-28T04:09:29Z"
labels:
component: velero
name: velero
namespace: velero
resourceVersion: "4807231"
selfLink: /api/v1/namespaces/velero/serviceaccounts/velero
uid: 29925587-17d1-42a7-9d7d-52b0f78b523c
secrets:
- name: velero-token-jrqdf
Here the BSL defijition:
$ kubectl -n velero get BackupStorageLocation default -o yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"velero.io/v1","kind":"BackupStorageLocation","metadata":{"annotations":{},"creationTimestamp":null,"labels":{"component":"velero"},"name":"default","namespace":"velero"},"spec":{"config":{"serviceAccount":"[email protected]"},"objectStorage":{"bucket":"velero-foo"},"provider":"gcp"}}
creationTimestamp: "2020-02-28T04:09:29Z"
generation: 1
labels:
component: velero
name: default
namespace: velero
resourceVersion: "4807233"
selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
uid: 7d05d762-38e6-4aba-9092-ce02b8398a9e
spec:
config:
serviceAccount: [email protected]
objectStorage:
bucket: velero-foo
provider: gcp
Is there any way I can bump-up the logging to understand where velero is failing to fetch credentials?
Hey folks, any ideas on what might be going on?
Please let me know if I can provide more information.
Cheers,
Hi @guilledipa sorry about the delay in responding here. Can you please share how you are installing Velero, if you are using the velero install cli then can you please share the command you are using?
Also, yaml for your velero deployment would be useful.
Thanks @ashish-amarnath! No worries at all :)
We use Anthos to manage the objects in GKE. Therefore we use velero install ... --dry-run -o yaml to generate this file:
https://gist.github.com/guilledipa/25a64d86bedf8c2364f28db302c707da
(which is enforced by Anthos).
The command is:
velero install --image gcr.io/foo/velero:v1.2.0 --provider gcp --plugins gcr.io/foo/velero-plugin-for-gcp:v1.0.0 --bucket velero-foo --no-secret --sa-annotations iam.gke.io/[email protected] --backup-location-config [email protected] --dry-run -o yaml
This is what the velero-plugin-for gcp may be waiting on https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L78
Most likely waiting here https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/master/velero-plugin-for-gcp/object_store.go#L139
I'd suggest adding some log messages around that code to investigate. I can build a custom image with instrumentation for you to try. It is almost EOD for me today. I can do this first up tomorrow.
Also suggest checking steps 5,6,7 and 8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity in case you've not already done that.
Thank a lot @ashish-amarnath! Happy to test the new image when it's available.
Regarding the steps 5,6,7,8 from https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity, we have this setup correctly:
5, 6, 8. Verified in https://github.com/vmware-tanzu/velero/issues/2303#issuecomment-593683303
$ gcloud iam service-accounts get-iam-policy [email protected]
bindings:
- members:
- serviceAccount:foo.svc.id.goog[velero/velero]
role: roles/iam.workloadIdentityUser
etag: BwWe4OomD8Y=
version: 1
Hi @ashish-amarnath , Just wanted to let you know that for completeness, I tried running velero:v1.3.1 and velero-plugin-for-gcp:v1.0.1 but unfortunately, I got to the same state.
Hey folks, I found the issue (and workaround):
DNS resolution within the pod wasn't working:
nobody@velero-599bf9ff5d-lgtpd:/$ getent hosts metadata.google.internal
I spinned up the workload-identity-test following GCP documetation to test WorkloadIdentity:
$ kubectl -n velero exec workload-identity-test -it -- /bin/bash
root@workload-identity-test:/# curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token
curl: (6) Could not resolve host: metadata.google.internal
So, I patched the Velero Deployment config, changing the dnsPolicy (which was ClusterFirst):
$ kubectl -n velero patch deployment velero -p '{"spec":{"template":{"spec":{"dnsPolicy": "Default"}}}}'
Which fixed the name resolution:
nobody@velero-7b87568cdc-rtsl9:/$ getent hosts metadata.google.internal
169.254.169.254 metadata.google.internal
After this, velero started working successfully using WorkloadIdentity
Most helpful comment
Hey folks, I found the issue (and workaround):
DNS resolution within the pod wasn't working:
I spinned up the
workload-identity-testfollowing GCP documetation to testWorkloadIdentity:So, I patched the Velero
Deploymentconfig, changing thednsPolicy(which wasClusterFirst):Which fixed the name resolution:
After this, velero started working successfully using
WorkloadIdentity