What steps did you take and what happened:
I'm backing up a pvc with velero+restic, when i check the size inside the original pod that has attached that PVC i can see the following:
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 16G 58M 16G 1% /var/lib/kafka/data
Then, the snapshot of that PVC is done and properly created (velero backup create migration-2-az-core --include-namespacesxxxxxx --include-resources=pods,pvc,pv,sts,serviceaccount,secret,configmap -l 'app in (kafka,zookeeper)') but when i mount locally the repo and check the size of that snapshot i can see the following:
du -hsl snapshots/2020-02-24T16\:53\:04Z/data
19G snapshots/2020-02-24T16:53:04Z/data
The size increases to 19Gi which is greater than the current capacity of the PVC and causes restic was not able to restore it because:
"pod volume restore failed: error restoring volume: error creating .velero directory for done file: mkdir /host_pods/a705eb10-0081-11ea-863d000d3a041e79/volumes/kubernetes.io~azure-disk/pvc-a329f082-0081-11ea-863d-000d3a041e79/.velero: no space left on device"
I cant understand what is happening during the backup process but something is changing the size of all files inside the snapshot
What did you expect to happen:
Restore succesfully applied
The output of the following commands will help us better understand what's going on:
Anything else you would like to add:
More info, this is specially impacting the backup with a large number of files inside de PVC, each copied file in s3 is being increased in terms of size:
Snapshot in S3:
du -hsl data/*
21M data/baikal.core.accesses-0
21M data/baikal.core.accesses-1
21M data/baikal.core.accesses-10
21M data/baikal.core.accesses-11
21M data/baikal.core.accesses-12
21M data/baikal.core.accesses-13
21M data/baikal.core.accesses-14
21M data/baikal.core.accesses-15
21M data/baikal.core.accesses-16
21M data/baikal.core.accesses-17
21M data/baikal.core.accesses-18
21M data/baikal.core.accesses-19
21M data/baikal.core.accesses-2
and so on...
PVC in k8s cluster:
du -hsl /var/lib/kafka/data/*
12K /var/lib/kafka/data/__consumer_offsets-0
8.0K /var/lib/kafka/data/__consumer_offsets-1
20K /var/lib/kafka/data/__consumer_offsets-10
1.1M /var/lib/kafka/data/__consumer_offsets-11
12K /var/lib/kafka/data/__consumer_offsets-12
8.0K /var/lib/kafka/data/__consumer_offsets-13
8.0K /var/lib/kafka/data/__consumer_offsets-14
12K /var/lib/kafka/data/__consumer_offsets-15
20K /var/lib/kafka/data/__consumer_offsets-16
48K /var/lib/kafka/data/__consumer_offsets-17
8.0K /var/lib/kafka/data/__consumer_offsets-18
20K /var/lib/kafka/data/__consumer_offsets-19
32K /var/lib/kafka/data/__consumer_offsets-2
...
Environment:
kubectl version):/etc/os-release):@esierrapena Thank you for opening this issue.
Can you please share the logs from the velero pod?
Also, how easy is this to repro, if this is easy to repro, can you please share w/ us how this issue can be repro'd?
It always happens with the following scenario:
Find the logs here: https://pastebin.com/654CefjH
These are the steps that i've perfomed to this situation:
apiVersion: apps/v1
kind: DaemonSet
metadata:
creationTimestamp: null
labels:
app: restic
name: restic
namespace: kube-system
spec:
selector:
matchLabels:
app: restic
template:
metadata:
creationTimestamp: null
labels:
app: restic
spec:
containers:
- args:
- restic
- server
command:
- /velero
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: VELERO_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: VELERO_SCRATCH_DIR
value: /scratch
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
image: gcr.io/heptio-images/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: restic
resources: {}
volumeMounts:
- mountPath: /host_pods
mountPropagation: HostToContainer
name: host-pods
- mountPath: /scratch
name: scratch
- mountPath: /credentials
name: cloud-credentials
securityContext:
runAsUser: 0
serviceAccountName: velero
tolerations:
- effect: NoSchedule
operator: Exists
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pods
name: host-pods
- emptyDir: {}
name: scratch
- name: cloud-credentials
secret:
secretName: velero
updateStrategy: {}
kubectl -n ${namespace_core} annotate ${pod_in_sts} backup.velero.io/backup-volumes=${volume_sts_name}
velero backup create migration-2-az-core --include-namespaces=xxxxxxxxx --include-resources=pods,pvc,pv,sts,serviceaccount,secret,configmap -l 'app in (kafka,zookeeper)'
kubectl exec -it kafka-0 -n esierra-default-aws-dev -- df -h /var/lib/kafka/data
restic mount -r s3:s3-us-east-1.amazonaws.com/baikal-snapshots-987091454837/restic/esierra-default-aws-dev /tmp/restic/
Same result updating the version to 1.2.0
Hi,
After an investigation, here is what I found:
du by default only shows physical size of each file. You can use --apparent-size to get the logical size.
In your example, it is due to sparse files managed by Kafka. On each topic partition there are two of these files at least:
Both files are defined by the property segment.index.bytes and default to 10485760 bytes.
Because Restic reads the file from the application layer, the logical size is processed.
As a quick workaround, please run du --apparent-size /var/lib/kafka/data to get the logical size of your Kafka setup and make sure that the target PVC is equal or larger.
As a future work, this issue may be transfered / recreated to Restic repository, asking for support on restoring sparse files.
Adding this discussion about sparse file support from Restic forum to the thread.
From Restic repository:
Adding the feature request for backing up / restoring sparse files:
Adding the pull request were support is being added:
Thanks @jpfe-tid, you were spot on!
Thanks @jpfe-tid for the investigation here!
@esierrapena I think we can close this issue.
Feel free to re-open if you need more help.
Most helpful comment
Hi,
After an investigation, here is what I found:
duby default only shows physical size of each file. You can use--apparent-sizeto get the logical size.In your example, it is due to sparse files managed by Kafka. On each topic partition there are two of these files at least:
Both files are defined by the property
segment.index.bytesand default to 10485760 bytes.Because Restic reads the file from the application layer, the logical size is processed.
As a quick workaround, please run
du --apparent-size /var/lib/kafka/datato get the logical size of your Kafka setup and make sure that the target PVC is equal or larger.As a future work, this issue may be transfered / recreated to Restic repository, asking for support on restoring sparse files.