Velero: Backups are failing if there is a large number of k8s resources

Created on 22 Oct 2019 · 6Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
[A clear and concise description of what the bug is, and what commands you ran.)

Create a backup of a namespace that contains more than 1000 k8s resources (configmaps, in our case) either by running velero backup create or as a scheduled backup
Roughly with a 50% probability, the backup will fail. Running velero backup describe only reveals the Failed backup status, containing no other information about the root cause of the problem. Running velero backup logs returns a log that seems totally fine. No errors, no warnings, just listing the resources being backed up.
Velero service logs contain this line: level=error msg="backup failed" controller=backup error="[rpc error: code = Unknown desc = EOF, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: ]" key=/ logSource="pkg/controller/backup_controller.go:230".
There are only 2 files in the blob storage where the backup is getting uploaded: the archived log file and velero-backup.json. Everything else is missing, including the backup itself.
Reducing the number of k8s resources to, let's say, by half fixes the problem. If it raises up to about 1000, it starts failing from time to time again

What did you expect to happen:
Backups are functioning for a large amount of k8s resources

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version): 1.0.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): 1.13.10
Kubernetes installer & version:
Cloud provider or hardware configuration: Microsoft Azure
OS (e.g. from /etc/os-release): Linux (ubuntu 16.04)

Question

Source

skhalash

👍1

All 6 comments

@skhalash did you try increasing the resource limits for the velero deployment?

skriss on 4 Nov 2019

👍1

Thank you @skriss!
I had the same problem with Velero v1.1.0 -- backup of an explicit list of namespaces worked, but not for "*" namespaces.
Increase memory limit from 256M to 1GB helped. Backup runs stable now.

opusmagnum on 6 Nov 2019

👍1

It sounds like we're resolved here, so closing this out. Feel free to reach out again if tuning the Velero requests/limits doesn't help.

skriss on 8 Nov 2019

@skriss Getting this exact error but I don't have any limit set on my containers. I even tried setting a limit higher than what the process is using when crashing, but it doesn't help. When the backup process stops with error level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = EOF" key=velero/test logSource="pkg/controller/backup_controller.go:265", process' memory has reached usage between 950MB and 1100MB. No resourcequota.

Inspecting the container from CRI doesn't show any limit applied to it. Nothing in worker's dmesg, nothing in the backup logs as stated.
I can reproduce the error with official velero image 1.2.0, 1.3.0 and 1.3.2.

Kubernetes version is 1.16, containerd 1.2.10, Linux 5.2.17

guillaumefenollar on 13 May 2020

@guillaumefenollar did you find a solution/workaround for your problem?

oguzdag on 10 Sep 2020

@guillaumefenollar did you find a solution/workaround for your problem?

I excluded events from my backups and they're passing now :

template:
    excludedResources:
    - events
    - events.k8s.io

Not 100% sure this was the only necessary step to make them work though .. :-/

guillaumefenollar on 11 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Creating or updating a schedule immediately starts a backup.

Yggdrasil · 3Comments

delete fails on failed backups

abh · 4Comments

Restic Backup - HowTo

Berndinox · 3Comments

Specify Velero Backup Schedules as YAML

archmangler · 3Comments

Add search to the website

carlisia · 4Comments