Velero: AWS volume not created during restore

Created on 29 Oct 2019 · 24Comments · Source: vmware-tanzu/velero

What steps did you take:

I am attempting to POC velero for our environment. For this I am running a test backup and restore.

Launch our cluster with some pods which include PVs and PVCs.
Create a backup: velero backup create --include-namespaces="tgrosinger" test4
Verify the snapshots have finished creating in S3.
Terminate all pods in my namespace and ensure the PVs are removed and the volumes in AWS were deleted.
Restore the backup into the same Kubernetes cluster: velero restore create --from-backup test4

What did you expect to happen:

I expect the deployments/pods/pvs/etc. to all be recreated, and the volumes to be recreated from the snapshots.

What _actually_ happened:

The deployments/pods/pvs/etc. were all recreated, however no volumes were recreated from the snapshots. When describing a PV, an AWS volume ID is listed, however that volume does not exist in the AWS console.

The output of the following commands will help us better understand what's going on:

See Gist: https://gist.github.com/tgrosinger/73f41a3df902558253e809397421f809

Environment:

Velero version (use velero version):

$ velero version
Client:
        Version: v1.1.0
        Git commit: a357f21aec6b39a8244dd23e469cc4519f1fe608
Server:
        Version: v1.1.0

Velero features (use velero client config get features):
Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:41:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.11", GitCommit:"25074a190ef2a07d8b0ed38734f2cb373edfb868", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: Kops 1.13
Cloud provider or hardware configuration: AWS

AreClouAWS Question

Source

tgrosinger

All 24 comments

Hmm, strange - from the logs it looks like everything happened as expected in the velero backup/restore. Are you getting an error on the Kubernetes side indicating that the volume can't be found?

Can you provide the YAML for the PV, both when backed up and after restored? It looks like there's only one PV in question.

skriss on 30 Oct 2019

I am not seeing any errors, but let me know if there's somewhere in particular I should check. Here is the PV as restored:

$ kubectl describe pv pvc-0c45845b-fa76-11e9-be69-028d8d40c278
Name:              pvc-0c45845b-fa76-11e9-be69-028d8d40c278
Labels:            failure-domain.beta.kubernetes.io/region=us-west-2
                   failure-domain.beta.kubernetes.io/zone=us-west-2a
                   velero.io/backup-name=test4
                   velero.io/restore-name=test4-20191029110739
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             tgrosinger/myapp-storage
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          2Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-west-2a]
                   failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2a/vol-0d3936dd80c79b674
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

$ kubectl get pv -o yaml pvc-0c45845b-fa76-11e9-be69-028d8d40c278
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: aws-ebs-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
  creationTimestamp: "2019-10-29T18:07:40Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2a
    velero.io/backup-name: test4
    velero.io/restore-name: test4-20191029110739
  name: pvc-0c45845b-fa76-11e9-be69-028d8d40c278
  resourceVersion: "37561235"
  selfLink: /api/v1/persistentvolumes/pvc-0c45845b-fa76-11e9-be69-028d8d40c278
  uid: ff6bdca3-fa76-11e9-bab1-0234d75aa54c
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://us-west-2a/vol-0d3936dd80c79b674
  capacity:
    storage: 2Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: myapp-storage
    namespace: tgrosinger
    resourceVersion: "37561232"
    uid: ff6e2f24-fa76-11e9-bab1-0234d75aa54c
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - us-west-2a
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - us-west-2
  persistentVolumeReclaimPolicy: Delete
  storageClassName: gp2
  volumeMode: Filesystem
status:
  phase: Bound

And here's some data from the s3 bucket:

$ \cat test4-podvolumebackups.json 
null

$ cat test4-volumesnapshots.json 
[{"spec":{"backupName":"test4","backupUID":"88748795-fa76-11e9-bab1-0234d75aa54c","location":"default","PersistentVolumeName":"pvc-0c45845b-fa76-11e9-be69-028d8d40c278","providerVolumeID":"vol-08b68ab4b92a44b65","volumeType":"gp2","volumeAZ":"us-west-2a"},"status":{"providerSnapshotID":"snap-0cf45a07c3aaed683","phase":"Completed"}}]

$ cat test4/resources/persistentvolumes/cluster/pvc-0c45845b-fa76-11e9-be69-028d8d40c278.json 
{"apiVersion":"v1","kind":"PersistentVolume","metadata":{"annotations":{"kubernetes.io/createdby":"aws-ebs-dynamic-provisioner","pv.kubernetes.io/bound-by-controller":"yes","pv.kubernetes.io/provisioned-by":"kubernetes.io/aws-ebs"},"creationTimestamp":"2019-10-29T18:00:58Z","finalizers":["kubernetes.io/pv-protection"],"labels":{"failure-domain.beta.kubernetes.io/region":"us-west-2","failure-domain.beta.kubernetes.io/zone":"us-west-2a"},"name":"pvc-0c45845b-fa76-11e9-be69-028d8d40c278","resourceVersion":"37558181","selfLink":"/api/v1/persistentvolumes/pvc-0c45845b-fa76-11e9-be69-028d8d40c278","uid":"0fa4e0b5-fa76-11e9-be69-028d8d40c278"},"spec":{"accessModes":["ReadWriteOnce"],"awsElasticBlockStore":{"fsType":"ext4","volumeID":"aws://us-west-2a/vol-08b68ab4b92a44b65"},"capacity":{"storage":"2Gi"},"claimRef":{"apiVersion":"v1","kind":"PersistentVolumeClaim","name":"myapp-storage","namespace":"tgrosinger","resourceVersion":"37558131","uid":"0c45845b-fa76-11e9-be69-028d8d40c278"},"nodeAffinity":{"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["us-west-2a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["us-west-2"]}]}]}},"persistentVolumeReclaimPolicy":"Delete","storageClassName":"gp2","volumeMode":"Filesystem"},"status":{"phase":"Bound"}}

tgrosinger on 30 Oct 2019

Is the pod that uses this PV running? Is there data in the mounted volume?

skriss on 30 Oct 2019

There is a single file in the volume so I could test if the volume was actually restored.

Here's the state of the pod. It's in ContainerCreating.

$ kubectl describe pod myapp-pod 
Name:               myapp-pod
Namespace:          tgrosinger
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-224-62-172.us-west-2.compute.internal/10.224.62.172
Start Time:         Tue, 29 Oct 2019 11:07:41 -0700
Labels:             app=myapp
                    velero.io/backup-name=test4
                    velero.io/restore-name=test4-20191029110739
Annotations:        cni.projectcalico.org/podIP: 100.97.121.85/32
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"myapp"},"name":"myapp-pod","namespace":"tgrosinger"},"spec":...
Status:             Pending
IP:                 
Containers:
  myapp-container:
    Container ID:  
    Image:         busybox
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      echo Hello Kubernetes! && sleep 3600
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mystorage from storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-prsnn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  myapp-storage
    ReadOnly:   false
  default-token-prsnn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-prsnn
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Warning  FailedAttachVolume  54s (x792 over 26h)  attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-0c45845b-fa76-11e9-be69-028d8d40c278" : InvalidVolume.NotFound: The volume 'vol-0d3936dd80c79b674' does not exist.
           status code: 400, request id: e7fcce3d-8816-49aa-b12c-5e2c18712287
  Warning  FailedMount  36s (x34 over 75m)  kubelet, ip-10-224-62-172.us-west-2.compute.internal  Unable to mount volumes for pod "myapp-pod_tgrosinger(ff72e1d0-fa76-11e9-bab1-0234d75aa54c)": timeout expired waiting for volumes to attach or mount for pod "tgrosinger"/"myapp-pod". list of unmounted volumes=[storage]. list of unattached volumes=[storage default-token-prsnn]

tgrosinger on 30 Oct 2019

Hmm, OK - I don't see any indication of where that volume went.

Can you provide the output of velero snapshot-location get -o yaml?
Do you have any CloudTrail logs that refer to the missing volume (vol-0d3936dd80c79b674)?

It's possible there's something specific to kops going on, but I don't use it regularly so not sure off the top of my head what that might be. I believe this (https://velero.io/docs/v1.1.0/aws-config/#setting-aws_cluster_name-optional) is relevant for kops, but shouldn't be an issue if you're not migrating across clusters.

skriss on 30 Oct 2019

$ velero snapshot-location get -o yaml
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  creationTimestamp: 2019-10-24T19:20:56Z
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "34283503"
  selfLink: /apis/velero.io/v1/namespaces/velero/volumesnapshotlocations/default
  uid: 674a294f-f693-11e9-be69-028d8d40c278
spec:
  config:
    region: us-west-2
  provider: aws
status: {}

I did find this in cloudtrail:

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "**********:deb540d4-velero",
        "arn": "arn:aws:sts::302270543666:assumed-role/velero/deb540d4-velero",
        "accountId": "**********",
        "accessKeyId": "**********",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "**********",
                "arn": "arn:aws:iam::**********:role/velero",
                "accountId": "**********",
                "userName": "velero"
            },
            "webIdFederationData": {},
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2019-10-29T18:00:38Z"
            }
        }
    },
    "eventTime": "2019-10-29T18:07:40Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "CreateVolume",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "AWS Internal",
    "userAgent": "aws-sdk-go/1.13.12 (go1.12.9; linux; amd64)",
    "requestParameters": {
        "snapshotId": "snap-0cf45a07c3aaed683",
        "zone": "us-west-2a",
        "volumeType": "gp2",
        "encrypted": true,
        "tagSpecificationSet": {
            "items": [
                {
                    "resourceType": "volume",
                    "tags": [
                        {
                            "key": "Project",
                            "value": "i-0b66e221ccd74ecb6"
                        },
                        {
                            "key": "KubernetesCluster",
                            "value": "pdx.he.extrahop.com"
                        },
                        {
                            "key": "kubernetes.io/created-for/pvc/namespace",
                            "value": "tgrosinger"
                        },
                        {
                            "key": "kubernetes.io/created-for/pv/name",
                            "value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                        },
                        {
                            "key": "kubernetes.io/created-for/pvc/name",
                            "value": "myapp-storage"
                        },
                        {
                            "key": "Name",
                            "value": "pdx.he.extrahop.com-dynamic-pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                        },
                        {
                            "key": "Username",
                            "value": "i-0b66e221ccd74ecb6"
                        },
                        {
                            "key": "kubernetes.io/cluster/pdx.he.extrahop.com",
                            "value": "owned"
                        },
                        {
                            "key": "velero.io/pv",
                            "value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                        },
                        {
                            "key": "velero.io/backup",
                            "value": "test4"
                        },
                        {
                            "key": "LastStateChange",
                            "value": "{\"device\": \"/dev/xvdbz\", \"user\": \"i-0b66e221ccd74ecb6\", \"time\": \"20191029T180351Z\"}"
                        },
                        {
                            "key": "velero.io/storage-location",
                            "value": "default"
                        }
                    ]
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "581edd06-c265-4f42-a680-bcc19d2afd62",
        "volumeId": "vol-0d3936dd80c79b674",
        "size": "2",
        "snapshotId": "snap-0cf45a07c3aaed683",
        "zone": "us-west-2a",
        "status": "creating",
        "createTime": 1572372460000,
        "volumeType": "gp2",
        "iops": 100,
        "encrypted": true,
        "masterEncryptionKeyId": "arn:aws:kms:us-west-2:**********:key/93771dc4-4b8a-4773-a9d3-05eee0ec47bd",
        "tagSet": {
            "items": [
                {
                    "key": "Project",
                    "value": "i-0b66e221ccd74ecb6"
                },
                {
                    "key": "KubernetesCluster",
                    "value": "pdx.he.extrahop.com"
                },
                {
                    "key": "kubernetes.io/created-for/pvc/namespace",
                    "value": "tgrosinger"
                },
                {
                    "key": "kubernetes.io/created-for/pv/name",
                    "value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                },
                {
                    "key": "kubernetes.io/created-for/pvc/name",
                    "value": "myapp-storage"
                },
                {
                    "key": "Name",
                    "value": "pdx.he.extrahop.com-dynamic-pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                },
                {
                    "key": "Username",
                    "value": "i-0b66e221ccd74ecb6"
                },
                {
                    "key": "kubernetes.io/cluster/pdx.he.extrahop.com",
                    "value": "owned"
                },
                {
                    "key": "velero.io/pv",
                    "value": "pvc-0c45845b-fa76-11e9-be69-028d8d40c278"
                },
                {
                    "key": "velero.io/backup",
                    "value": "test4"
                },
                {
                    "key": "LastStateChange",
                    "value": "{\"device\": \"/dev/xvdbz\", \"user\": \"i-0b66e221ccd74ecb6\", \"time\": \"20191029T180351Z\"}"
                },
                {
                    "key": "velero.io/storage-location",
                    "value": "default"
                }
            ]
        }
    },
    "requestID": "581edd06-c265-4f42-a680-bcc19d2afd62",
    "eventID": "d783e5b4-4218-45f0-ab60-d41a102ee1d6",
    "eventType": "AwsApiCall",
    "recipientAccountId": "**********"
}

tgrosinger on 30 Oct 2019

So the create volume certainly appears to have been triggered....no other cloudtrail events for that volume?

skriss on 1 Nov 2019

None that I was able to find. But let me know if you have a specific query in mind that I should run.

tgrosinger on 1 Nov 2019

I think you should be able to do something like this to look for any events related to that volume (not a cloud trail expert):

aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=vol-0d3936dd80c79b674

skriss on 1 Nov 2019

No, that is the only CloudTrail event for that resource name.

tgrosinger on 4 Nov 2019

Hmm, that's very strange. Have you tried this a few times and get the same result every time?

skriss on 4 Nov 2019

@tgrosinger just checking in to see if you were able to get any more information or if there's anything else we can do to help. This is definitely not typical behavior, but without gathering some more info on what's happening to that volume, I'm at a loss.

skriss on 14 Nov 2019

@skriss, thank you for following up. I really appreciate the help so far!

Let me try again today from the top and get back to you.

tgrosinger on 18 Nov 2019

I started again from scratch after upgrading to Velero 1.2.0 and the aws plugin 1.0.0 but had the same results.

Just in case it is relevant, I am using kube2iam. I am confident this is working though because it does successfully write to the bucket and create the snapshot.

Let me know if you can think of any more troubleshooting steps or information I can provide please.

tgrosinger on 18 Nov 2019

Sorry for the "me too" but I just came across this and am experiencing almost the exact same behavior. I ran through all of your steps (including cloud trail) and produced almost identical results. The pod gets stuck "container creating" with PVCs/PVs that /claim/ to be fine but the EBS volume they claim to be attached simply does not exist in any region I can find. All the values that I checked in the cloud trail correlate with my account, region, etc.

Some variables to consider:

these are encrypted volumes. could the KMS keys be causing problems? Is there a way to determine this? The policy seems to declare the permissions, but I could be wrong.
in my case, we are using rotating IAM credentials from Hashicorp Vault. These SHOULD be completely valid during the restore process, but it makes me wonder if something could be going oddly.
I ran all my diagnosis commands as a normal IAM personal user with admin access
The IAM creds originally used to do the backup have since been deleted/rotated/recreated by Vault. However, I experience the exact same behavior within minutes of creating the first backup and trying to restore it as a PVC.

For reference, I'm running these versions:
Client:
Version: v1.2.0
Git commit: -
Server:
Version: v1.1.0

If this is too much to follow as a single ticket or irrelevant, let me know and I'll create a separate issue.

RichardMills on 26 Nov 2019

I was finally able to make my Vault restore work properly. I'm pretty sure it was the KMS keys on the encrypted volumes. The symptom of an inaccessible (or wrong) KMS key is that AWS claims to be creating the volume but then aborts and generates no detectable error that I can find. The status goes from "Creating..." to "That volume does not exist."

I manually tested this by setting my AWS_ACCESS_KEY_ID and PW to the same keys that Velero was using, then running AWS CLI commands to create the volume:

aws ec2 create-volume --availability-zone us-east-1a --encrypted --kms-key-id arn:aws:kms:us-east-1:XXX:key/XXX --snapshot-id snap-XXX --volume-type gp2 --tag-specifications ResourceType=volume,Tags="[{Key=Name,Value=rmills-volume-restore-test}]"

I noticed the same behavior that Velero was producing.

I had a compound of two problems. 1) I was using a different key than originally used for the encrypted snapshot volume, 2) the KMS key in AWS was not allowing the Velero IAM user to use the key. Once those problems were fixed, it started work.

Moral of the story: make sure your KMS keys are working and correct first, then debug velero.

RichardMills on 27 Nov 2019

❤1

@RichardMills thanks for that information! @tgrosinger does @RichardMills' explanation sound relevant for your scenario?

skriss on 5 Dec 2019

I'm going to close this out as inactive, feel free to reach out again as needed.

skriss on 13 Dec 2019

@skriss, thank you. Sorry I have not had a chance to get back to this. I will re-open if I try out the solution from Richard and am still having trouble.

tgrosinger on 13 Dec 2019

👍1

Relates to https://github.com/vmware-tanzu/velero-plugin-for-aws/issues/15

tgrosinger on 30 Dec 2019

@skriss just verified it is working after taking the suggestion from @RichardMills!

I needed to add kms:ReEncryptTo and kms:ReEncryptFrom for the key that is used by the Kubernetes cluster (which was also the key used to encrypt the snapshot).

Thanks both for your help.

tgrosinger on 31 Dec 2019

🎉1

@tgrosinger curious what you IAM policy looked like at the end of all of this?

AlexMorreale on 4 Feb 2020

FWIW This is my full policy excuse the terraform variables for buckets and kms keys

{         
    "Version": "2012-10-17",
    "Statement": [
        { 
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:DescribeSnapshots",
                "ec2:CreateTags",
                "ec2:CreateVolume",
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot"
            ],
            "Resource": "*"
        },
        { 
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "${aws_s3_bucket.me.arn}/*"
            ]
        },
        { 
            "Effect": "Allow",
            "Action": [
                "kms:GenerateDataKey*",
                "kms:Decrypt",
                "kms:ReEncryptTo",
                "kms:ReEncryptFrom"
            ],
            "Resource": [
                "${aws_kms_key.me.arn}"
            ]
        },                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
        { 
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "${aws_s3_bucket.me.arn}"
            ]
        } 
    ]        
}

AlexMorreale on 5 Feb 2020

Mine looks very similar. The only change is that I did not have "kms:GenerateDataKey*" or "kms:Decrypt".

tgrosinger on 5 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings