Velero: Restore of velero+restic backup error "no space left on device"

Created on 25 Feb 2020 · 8Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:

I'm backing up a pvc with velero+restic, when i check the size inside the original pod that has attached that PVC i can see the following:

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1     16G   58M   16G   1% /var/lib/kafka/data

Then, the snapshot of that PVC is done and properly created (velero backup create migration-2-az-core --include-namespacesxxxxxx --include-resources=pods,pvc,pv,sts,serviceaccount,secret,configmap -l 'app in (kafka,zookeeper)') but when i mount locally the repo and check the size of that snapshot i can see the following:

du -hsl snapshots/2020-02-24T16\:53\:04Z/data
19G snapshots/2020-02-24T16:53:04Z/data

The size increases to 19Gi which is greater than the current capacity of the PVC and causes restic was not able to restore it because:

"pod volume restore failed: error restoring volume: error creating .velero directory for done file: mkdir /host_pods/a705eb10-0081-11ea-863d000d3a041e79/volumes/kubernetes.io~azure-disk/pvc-a329f082-0081-11ea-863d-000d3a041e79/.velero: no space left on device"

I cant understand what is happening during the backup process but something is changing the size of all files inside the snapshot

What did you expect to happen:

Restore succesfully applied

The output of the following commands will help us better understand what's going on:

Anything else you would like to add:

More info, this is specially impacting the backup with a large number of files inside de PVC, each copied file in s3 is being increased in terms of size:
Snapshot in S3:

du -hsl data/*
21M data/baikal.core.accesses-0
21M data/baikal.core.accesses-1
21M data/baikal.core.accesses-10
21M data/baikal.core.accesses-11
21M data/baikal.core.accesses-12
21M data/baikal.core.accesses-13
21M data/baikal.core.accesses-14
21M data/baikal.core.accesses-15
21M data/baikal.core.accesses-16
21M data/baikal.core.accesses-17
21M data/baikal.core.accesses-18
21M data/baikal.core.accesses-19
21M data/baikal.core.accesses-2
and so on...

PVC in k8s cluster:

du -hsl /var/lib/kafka/data/*
12K /var/lib/kafka/data/__consumer_offsets-0
8.0K    /var/lib/kafka/data/__consumer_offsets-1
20K /var/lib/kafka/data/__consumer_offsets-10
1.1M    /var/lib/kafka/data/__consumer_offsets-11
12K /var/lib/kafka/data/__consumer_offsets-12
8.0K    /var/lib/kafka/data/__consumer_offsets-13
8.0K    /var/lib/kafka/data/__consumer_offsets-14
12K /var/lib/kafka/data/__consumer_offsets-15
20K /var/lib/kafka/data/__consumer_offsets-16
48K /var/lib/kafka/data/__consumer_offsets-17
8.0K    /var/lib/kafka/data/__consumer_offsets-18
20K /var/lib/kafka/data/__consumer_offsets-19
32K /var/lib/kafka/data/__consumer_offsets-2
...

Environment:

Velero version: 1.1.0
Velero features: features:
Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bioni

Needs investigation Restic

Source

esierrapena

👍6

Most helpful comment

Hi,

After an investigation, here is what I found:

du by default only shows physical size of each file. You can use --apparent-size to get the logical size.

In your example, it is due to sparse files managed by Kafka. On each topic partition there are two of these files at least:

*.timeindex
*.index

Both files are defined by the property segment.index.bytes and default to 10485760 bytes.

Because Restic reads the file from the application layer, the logical size is processed.

As a quick workaround, please run du --apparent-size /var/lib/kafka/data to get the logical size of your Kafka setup and make sure that the target PVC is equal or larger.

As a future work, this issue may be transfered / recreated to Restic repository, asking for support on restoring sparse files.

jpfe-tid on 26 Feb 2020

❤2 👍2

All 8 comments

@esierrapena Thank you for opening this issue.
Can you please share the logs from the velero pod?
Also, how easy is this to repro, if this is easy to repro, can you please share w/ us how this issue can be repro'd?

ashish-amarnath on 26 Feb 2020

It always happens with the following scenario:

Kubernetes cluster deployed in AWS
Restic and velero version 1.1.0
Backup storage location: s3, us-east-1

Find the logs here: https://pastebin.com/654CefjH

These are the steps that i've perfomed to this situation:

Deploy restic (velero already deployed):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  creationTimestamp: null
  labels:
    app: restic
  name: restic
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: restic
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: restic
    spec:
      containers:
      - args:
        - restic
        - server
        command:
        - /velero
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        image: gcr.io/heptio-images/velero:v1.1.0
        imagePullPolicy: IfNotPresent
        name: restic
        resources: {}
        volumeMounts:
        - mountPath: /host_pods
          mountPropagation: HostToContainer
          name: host-pods
        - mountPath: /scratch
          name: scratch
        - mountPath: /credentials
          name: cloud-credentials
      securityContext:
        runAsUser: 0
      serviceAccountName: velero
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pods
        name: host-pods
      - emptyDir: {}
        name: scratch
      - name: cloud-credentials
        secret:
          secretName: velero
  updateStrategy: {}

Annotate pods with attached volumes to be copied:

kubectl -n ${namespace_core} annotate ${pod_in_sts} backup.velero.io/backup-volumes=${volume_sts_name}

Create backup:

velero backup create migration-2-az-core --include-namespaces=xxxxxxxxx --include-resources=pods,pvc,pv,sts,serviceaccount,secret,configmap -l 'app in (kafka,zookeeper)'

How to check:

kubectl exec -it kafka-0 -n esierra-default-aws-dev -- df -h /var/lib/kafka/data
restic mount -r s3:s3-us-east-1.amazonaws.com/baikal-snapshots-987091454837/restic/esierra-default-aws-dev /tmp/restic/

esierrapena on 26 Feb 2020

Same result updating the version to 1.2.0

esierrapena on 26 Feb 2020

Hi,

After an investigation, here is what I found:

du by default only shows physical size of each file. You can use --apparent-size to get the logical size.

In your example, it is due to sparse files managed by Kafka. On each topic partition there are two of these files at least:

*.timeindex
*.index

Both files are defined by the property segment.index.bytes and default to 10485760 bytes.

Because Restic reads the file from the application layer, the logical size is processed.

As a quick workaround, please run du --apparent-size /var/lib/kafka/data to get the logical size of your Kafka setup and make sure that the target PVC is equal or larger.

As a future work, this issue may be transfered / recreated to Restic repository, asking for support on restoring sparse files.

jpfe-tid on 26 Feb 2020

❤2 👍2

Adding this discussion about sparse file support from Restic forum to the thread.

https://forum.restic.net/t/sparse-file-support/1264/

jpfe-tid on 26 Feb 2020

From Restic repository:

Adding the feature request for backing up / restoring sparse files:

https://github.com/restic/restic/issues/79
Adding the pull request were support is being added:

https://github.com/restic/restic/pull/2378

jpfe-tid on 26 Feb 2020

Thanks @jpfe-tid, you were spot on!

esierrapena on 26 Feb 2020

Thanks @jpfe-tid for the investigation here!
@esierrapena I think we can close this issue.
Feel free to re-open if you need more help.

ashish-amarnath on 26 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add bash completion support

concaf · 3Comments

share backup location with multiple velero instances for cross cluster migration of workload

debianmaster · 3Comments

[v0.10] update `ark backup describe` for the changes to volume snapshot info storage

skriss · 4Comments

How to use Prometheus to monitor velero?

my1990 · 3Comments

Velero phase is failed but taking backup.

doronmak · 3Comments