What steps did you take:
I am attempting to POC velero for our environment. For this I am running a test backup and restore.
velero backup create --include-namespaces="tgrosinger" test4velero restore create --from-backup test4What did you expect to happen:
I expect the deployments/pods/pvs/etc. to all be recreated, and the volumes to be recreated from the snapshots.
What _actually_ happened:
The deployments/pods/pvs/etc. were all recreated, however no volumes were recreated from the snapshots. When describing a PV, an AWS volume ID is listed, however that volume does not exist in the AWS console.
The output of the following commands will help us better understand what's going on:
See Gist: https://gist.github.com/tgrosinger/73f41a3df902558253e809397421f809
Environment:
velero version): $ velero version
Client:
Version: v1.1.0
Git commit: a357f21aec6b39a8244dd23e469cc4519f1fe608
Server:
Version: v1.1.0
velero client config get features): kubectl version):$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:41:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Hmm, strange - from the logs it looks like everything happened as expected in the velero backup/restore. Are you getting an error on the Kubernetes side indicating that the volume can't be found?
Can you provide the YAML for the PV, both when backed up and after restored? It looks like there's only one PV in question.
I am not seeing any errors, but let me know if there's somewhere in particular I should check. Here is the PV as restored:
$ kubectl describe pv pvc-0c45845b-fa76-11e9-be69-028d8d40c278
Name: pvc-0c45845b-fa76-11e9-be69-028d8d40c278
Labels: failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2a
velero.io/backup-name=test4
velero.io/restore-name=test4-20191029110739
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: gp2
Status: Bound
Claim: tgrosinger/myapp-storage
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 2Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [us-west-2a]
failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2a/vol-0d3936dd80c79b674
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
$ kubectl get pv -o yaml pvc-0c45845b-fa76-11e9-be69-028d8d40c278
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: "yes"
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
creationTimestamp: "2019-10-29T18:07:40Z"
finalizers:
- kubernetes.io/pv-protection
labels:
failure-domain.beta.kubernetes.io/region: us-west-2
failure-domain.beta.kubernetes.io/zone: us-west-2a
velero.io/backup-name: test4
velero.io/restore-name: test4-20191029110739
name: pvc-0c45845b-fa76-11e9-be69-028d8d40c278
resourceVersion: "37561235"
selfLink: /api/v1/persistentvolumes/pvc-0c45845b-fa76-11e9-be69-028d8d40c278
uid: ff6bdca3-fa76-11e9-bab1-0234d75aa54c
spec:
accessModes:
- ReadWriteOnce
awsElasticBlockStore:
fsType: ext4
volumeID: aws://us-west-2a/vol-0d3936dd80c79b674
capacity:
storage: 2Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: myapp-storage
namespace: tgrosinger
resourceVersion: "37561232"
uid: ff6e2f24-fa76-11e9-bab1-0234d75aa54c
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-west-2a
- key: failure-domain.beta.kubernetes.io/region
operator: In
values:
- us-west-2
persistentVolumeReclaimPolicy: Delete
storageClassName: gp2
volumeMode: Filesystem
status:
phase: Bound
And here's some data from the s3 bucket:
$ \cat test4-podvolumebackups.json
null
$ cat test4-volumesnapshots.json
[{"spec":{"backupName":"test4","backupUID":"88748795-fa76-11e9-bab1-0234d75aa54c","location":"default","PersistentVolumeName":"pvc-0c45845b-fa76-11e9-be69-028d8d40c278","providerVolumeID":"vol-08b68ab4b92a44b65","volumeType":"gp2","volumeAZ":"us-west-2a"},"status":{"providerSnapshotID":"snap-0cf45a07c3aaed683","phase":"Completed"}}]
$ cat test4/resources/persistentvolumes/cluster/pvc-0c45845b-fa76-11e9-be69-028d8d40c278.json
{"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{"kubernetes.io/createdby":"aws-ebs-dynamic-provisioner","pv.kubernetes.io/bound-by-controller":"yes","pv.kubernetes.io/provisioned-by":"kubernetes.io/aws-ebs"},"creationTimestamp":"2019-10-29T18:00:58Z","finalizers":["kubernetes.io/pv-protection"],"labels":{"failure-domain.beta.kubernetes.io/region":"us-west-2","failure-domain.beta.kubernetes.io/zone":"us-west-2a"},"name":"pvc-0c45845b-fa76-11e9-be69-028d8d40c278","resourceVersion":"37558181","selfLink":"/api/v1/persistentvolumes/pvc-0c45845b-fa76-11e9-be69-028d8d40c278","uid":"0fa4e0b5-fa76-11e9-be69-028d8d40c278"},"spec":{"accessModes":["ReadWriteOnce"],"awsElasticBlockStore":{"fsType":"ext4","volumeID":"aws://us-west-2a/vol-08b68ab4b92a44b65"},"capacity":{"storage":"2Gi"},"claimRef":{"apiVersion":"v1","kind":"PersistentVolumeClaim","name":"myapp-storage","namespace":"tgrosinger","resourceVersion":"37558131","uid":"0c45845b-fa76-11e9-be69-028d8d40c278"},"nodeAffinity":{"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["us-west-2a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["us-west-2"]}]}]}},"persistentVolumeReclaimPolicy":"Delete","storageClassName":"gp2","volumeMode":"Filesystem"},"status":{"phase":"Bound"}}
Is the pod that uses this PV running? Is there data in the mounted volume?
There is a single file in the volume so I could test if the volume was actually restored.
Here's the state of the pod. It's in ContainerCreating.
$ kubectl describe pod myapp-pod
Name: myapp-pod
Namespace: tgrosinger
Priority: 0
PriorityClassName: <none>
Node: ip-10-224-62-172.us-west-2.compute.internal/10.224.62.172
Start Time: Tue, 29 Oct 2019 11:07:41 -0700
Labels: app=myapp
velero.io/backup-name=test4
velero.io/restore-name=test4-20191029110739
Annotations: cni.projectcalico.org/podIP: 100.97.121.85/32
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"myapp"},"name":"myapp-pod","namespace":"tgrosinger"},"spec":...
Status: Pending
IP:
Containers:
myapp-container:
Container ID:
Image: busybox
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
echo Hello Kubernetes! && sleep 3600
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mystorage from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-prsnn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: myapp-storage
ReadOnly: false
default-token-prsnn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-prsnn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedAttachVolume 54s (x792 over 26h) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-0c45845b-fa76-11e9-be69-028d8d40c278" : InvalidVolume.NotFound: The volume 'vol-0d3936dd80c79b674' does not exist.
status code: 400, request id: e7fcce3d-8816-49aa-b12c-5e2c18712287
Warning FailedMount 36s (x34 over 75m) kubelet, ip-10-224-62-172.us-west-2.compute.internal Unable to mount volumes for pod "myapp-pod_tgrosinger(ff72e1d0-fa76-11e9-bab1-0234d75aa54c)": timeout expired waiting for volumes to attach or mount for pod "tgrosinger"/"myapp-pod". list of unmounted volumes=[storage]. list of unattached volumes=[storage default-token-prsnn]
Hmm, OK - I don't see any indication of where that volume went.
Can you provide the output of velero snapshot-location get -o yaml?
Do you have any CloudTrail logs that refer to the missing volume (vol-0d3936dd80c79b674)?
It's possible there's something specific to kops going on, but I don't use it regularly so not sure off the top of my head what that might be. I believe this (https://velero.io/docs/v1.1.0/aws-config/#setting-aws_cluster_name-optional) is relevant for kops, but shouldn't be an issue if you're not migrating across clusters.
$ velero snapshot-location get -o yaml
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
creationTimestamp: 2019-10-24T19:20:56Z
generation: 1
labels:
component: velero
name: default
namespace: velero
resourceVersion: "34283503"
selfLink: /apis/velero.io/v1/namespaces/velero/volumesnapshotlocations/default
uid: 674a294f-f693-11e9-be69-028d8d40c278
spec:
config:
region: us-west-2
provider: aws
status: {}
I did find this in cloudtrail:
{
"eventVersion": "1.05",
"userIdentity": {
"type": "AssumedRole",
"principalId": "**********:deb540d4-velero",
"arn": "arn:aws:sts::302270543666:assumed-role/velero/deb540d4-velero",
"accountId": "**********",
"accessKeyId": "**********",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "**********",
"arn": "arn:aws:iam::**********:role/velero",
"accountId": "**********",
"userName": "velero"
},
"webIdFederationData": {},
"attributes": {
"mfaAuthenticated": "false",
"creationDate": "2019-10-29T18:00:38Z"
}
}
},
"eventTime": "2019-10-29T18:07:40Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "CreateVolume",
"awsRegion": "us-west-2",
"sourceIPAddress": "AWS Internal",
"userAgent": "aws-sdk-go/1.13.12 (go1.12.9; linux; amd64)",
"requestParameters": {
"snapshotId": "snap-0cf45a07c3aaed683",
"zone": "us-west-2a",
"volumeType": "gp2",
"encrypted": true,
"tagSpecificationSet": {
"items": [
{
"resourceType": "volume",
"tags": [
{
"key": "Project",
"value": "i-0b66e221ccd74ecb6"
},
{
"key": "KubernetesCluster",
"value": "pdx.he.extrahop.com"
},
{
"key": "kubernetes.io/created-for/pvc/namespace",
"value": "tgrosinger"
},
{
"key": "kubernetes.io/created-for/pv/name",
"value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "kubernetes.io/created-for/pvc/name",
"value": "myapp-storage"
},
{
"key": "Name",
"value": "pdx.he.extrahop.com-dynamic-pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "Username",
"value": "i-0b66e221ccd74ecb6"
},
{
"key": "kubernetes.io/cluster/pdx.he.extrahop.com",
"value": "owned"
},
{
"key": "velero.io/pv",
"value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "velero.io/backup",
"value": "test4"
},
{
"key": "LastStateChange",
"value": "{\"device\": \"/dev/xvdbz\", \"user\": \"i-0b66e221ccd74ecb6\", \"time\": \"20191029T180351Z\"}"
},
{
"key": "velero.io/storage-location",
"value": "default"
}
]
}
]
}
},
"responseElements": {
"requestId": "581edd06-c265-4f42-a680-bcc19d2afd62",
"volumeId": "vol-0d3936dd80c79b674",
"size": "2",
"snapshotId": "snap-0cf45a07c3aaed683",
"zone": "us-west-2a",
"status": "creating",
"createTime": 1572372460000,
"volumeType": "gp2",
"iops": 100,
"encrypted": true,
"masterEncryptionKeyId": "arn:aws:kms:us-west-2:**********:key/93771dc4-4b8a-4773-a9d3-05eee0ec47bd",
"tagSet": {
"items": [
{
"key": "Project",
"value": "i-0b66e221ccd74ecb6"
},
{
"key": "KubernetesCluster",
"value": "pdx.he.extrahop.com"
},
{
"key": "kubernetes.io/created-for/pvc/namespace",
"value": "tgrosinger"
},
{
"key": "kubernetes.io/created-for/pv/name",
"value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "kubernetes.io/created-for/pvc/name",
"value": "myapp-storage"
},
{
"key": "Name",
"value": "pdx.he.extrahop.com-dynamic-pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "Username",
"value": "i-0b66e221ccd74ecb6"
},
{
"key": "kubernetes.io/cluster/pdx.he.extrahop.com",
"value": "owned"
},
{
"key": "velero.io/pv",
"value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
},
{
"key": "velero.io/backup",
"value": "test4"
},
{
"key": "LastStateChange",
"value": "{\"device\": \"/dev/xvdbz\", \"user\": \"i-0b66e221ccd74ecb6\", \"time\": \"20191029T180351Z\"}"
},
{
"key": "velero.io/storage-location",
"value": "default"
}
]
}
},
"requestID": "581edd06-c265-4f42-a680-bcc19d2afd62",
"eventID": "d783e5b4-4218-45f0-ab60-d41a102ee1d6",
"eventType": "AwsApiCall",
"recipientAccountId": "**********"
}
So the create volume certainly appears to have been triggered....no other cloudtrail events for that volume?
None that I was able to find. But let me know if you have a specific query in mind that I should run.
I think you should be able to do something like this to look for any events related to that volume (not a cloud trail expert):
aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=vol-0d3936dd80c79b674
No, that is the only CloudTrail event for that resource name.
Hmm, that's very strange. Have you tried this a few times and get the same result every time?
@tgrosinger just checking in to see if you were able to get any more information or if there's anything else we can do to help. This is definitely not typical behavior, but without gathering some more info on what's happening to that volume, I'm at a loss.
@skriss, thank you for following up. I really appreciate the help so far!
Let me try again today from the top and get back to you.
I started again from scratch after upgrading to Velero 1.2.0 and the aws plugin 1.0.0 but had the same results.
Just in case it is relevant, I am using kube2iam. I am confident this is working though because it does successfully write to the bucket and create the snapshot.
Let me know if you can think of any more troubleshooting steps or information I can provide please.
Sorry for the "me too" but I just came across this and am experiencing almost the exact same behavior. I ran through all of your steps (including cloud trail) and produced almost identical results. The pod gets stuck "container creating" with PVCs/PVs that /claim/ to be fine but the EBS volume they claim to be attached simply does not exist in any region I can find. All the values that I checked in the cloud trail correlate with my account, region, etc.
Some variables to consider:
For reference, I'm running these versions:
Client:
Version: v1.2.0
Git commit: -
Server:
Version: v1.1.0
If this is too much to follow as a single ticket or irrelevant, let me know and I'll create a separate issue.
I was finally able to make my Vault restore work properly. I'm pretty sure it was the KMS keys on the encrypted volumes. The symptom of an inaccessible (or wrong) KMS key is that AWS claims to be creating the volume but then aborts and generates no detectable error that I can find. The status goes from "Creating..." to "That volume does not exist."
I manually tested this by setting my AWS_ACCESS_KEY_ID and PW to the same keys that Velero was using, then running AWS CLI commands to create the volume:
aws ec2 create-volume --availability-zone us-east-1a --encrypted --kms-key-id arn:aws:kms:us-east-1:XXX:key/XXX --snapshot-id snap-XXX --volume-type gp2 --tag-specifications ResourceType=volume,Tags="[{Key=Name,Value=rmills-volume-restore-test}]"
I noticed the same behavior that Velero was producing.
I had a compound of two problems. 1) I was using a different key than originally used for the encrypted snapshot volume, 2) the KMS key in AWS was not allowing the Velero IAM user to use the key. Once those problems were fixed, it started work.
Moral of the story: make sure your KMS keys are working and correct first, then debug velero.
@RichardMills thanks for that information! @tgrosinger does @RichardMills' explanation sound relevant for your scenario?
I'm going to close this out as inactive, feel free to reach out again as needed.
@skriss, thank you. Sorry I have not had a chance to get back to this. I will re-open if I try out the solution from Richard and am still having trouble.
@skriss just verified it is working after taking the suggestion from @RichardMills!
I needed to add kms:ReEncryptTo and kms:ReEncryptFrom for the key that is used by the Kubernetes cluster (which was also the key used to encrypt the snapshot).
Thanks both for your help.
@tgrosinger curious what you IAM policy looked like at the end of all of this?
FWIW This is my full policy excuse the terraform variables for buckets and kms keys
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": [
"${aws_s3_bucket.me.arn}/*"
]
},
{
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey*",
"kms:Decrypt",
"kms:ReEncryptTo",
"kms:ReEncryptFrom"
],
"Resource": [
"${aws_kms_key.me.arn}"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"${aws_s3_bucket.me.arn}"
]
}
]
}
Mine looks very similar. The only change is that I did not have "kms:GenerateDataKey*" or "kms:Decrypt".