Velero: Backup fails with transport is closing

Created on 9 Sep 2019 · 22Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:

We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:

time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"

Any idea why this happens and is there anything we can do to prevent this?

Anything else you would like to add:

Here is the backup file we use:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kyma-backup
  namespace: kyma-system
spec:
  includedNamespaces:
  - '*'
  includedResources:
  - '*'
  includeClusterResources: true
  storageLocation: default
  volumeSnapshotLocations: 
  - default

We just deploy this file to the cluster using kubectl apply -f.

Environment:

Velero version (use velero version): 1.0.0
Kubernetes version (use kubectl version): 1.13.9-gke.3
Kubernetes installer & version:
Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release):

ArePlugins

Source

suleymanakbas91

Most helpful comment

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

swe-pawe on 18 Jun 2020

👍4 🚀1

All 22 comments

I believe the gRPC error comes from the trying to talk to a plugin. If you add the --log-level debug flag to your velero server Pod, we might be able to get more info about what's happening here.

prydonius on 9 Sep 2019

looks like the same issue as #481

prydonius on 9 Sep 2019

It's strange because we are not using any plugins.

Here are the error logs after I set the log level as debug:

$ kubectl logs backup-75d69b8644-fjlz7 -n kyma-system | grep error
time="2019-09-10T14:03:28Z" level=error msg="reading plugin stderr" cmd=/velero controller=backup-sync error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=error msg="reading plugin stderr" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=debug msg="plugin process exited" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="signal: killed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/velero pid=136
Failed to fire hook: object logged as error does not satisfy error interface
time="2019-09-10T14:03:36Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 logSource="pkg/controller/backup_controller.go:230"

Full pod logs

suleymanakbas91 on 10 Sep 2019

@suleymanakbas91 good to know you don't have any external plugins, but Velero does include a set of default plugins so this is coming from those.

cc @skriss @nrb @carlisia any ideas?

prydonius on 10 Sep 2019

@suleymanakbas91 do you have CPU/mem requests/limits defined for your Velero deployment?

skriss on 10 Sep 2019

@skriss we don't specify any. You can also check out the helm chart we use from here: https://github.com/kyma-project/kyma/tree/master/resources/backup

suleymanakbas91 on 10 Sep 2019

We have run in to this as well and I just tried removing CPU/mem limits for the deployment and it looks much more stable.

Crevil on 19 Sep 2019

@Crevil @suleymanakbas91 are you able to see how much CPU/mem the Velero Pod is using?

prydonius on 19 Sep 2019

Sure thing @prydonius

We run backups every hour as seen by the spikes.

Crevil on 20 Sep 2019

@Crevil thanks for that! Were you previously using the default requests/limits provided by velero install? Your usage definitely goes over those defaults, could you tell us a little more about what you're including in your backups (e.g. no. of resources, whether you're using restic or not, and volume sizes if using restic)? Just trying to gauge if it makes sense to increase the default req/limits.

prydonius on 21 Sep 2019

I’ll ger back to you on monday with more details.

On Sat, 21 Sep 2019 at 01.39, Adnan Abdulhussein notifications@github.com
wrote:

@Crevil https://github.com/Crevil thanks for that! Were you previously
using the default requests/limits provided by velero install? Your usage
definitely goes over those defaults, could you tell us a little more about
what you're including in your backups (e.g. no. of resources, whether
you're using restic or not, and volume sizes if using restic)? Just trying
to gauge if it makes sense to increase the default req/limits.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/velero/issues/1856?email_source=notifications&email_token=ABUQDHW2AVJBDE7UW2ZUPRLQKVNJRA5CNFSM4IUZH2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IEZ3Q#issuecomment-533744878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUQDHV5M7Q64CLLDI4CMFDQKVNJRANCNFSM4IUZH2AA
.

>

Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177

Crevil on 21 Sep 2019

We were using limits as specified from the install command. Here is a shortened version of what we were running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: velero
spec:
  template:
    spec:
      containers:
      - name: velero
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi

We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots.
We are not using restic.

Let me know if you need any thing else.

Crevil on 23 Sep 2019

👍1

Really appreciate the info @Crevil. Strange, your backups are smaller than this user's, though their resource usage was lower.

It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits?

@suleymanakbas91 are you still experiencing this issue?

prydonius on 23 Sep 2019

It looks stable now, yes. We’ll set up an alert on the consumption to be
warned in the future.

On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein notifications@github.com
wrote:

Really appreciate the info @Crevil https://github.com/Crevil. Strange,
your backups are smaller than this user's
https://github.com/heptio/velero/issues/94#issuecomment-514561793,
though their resource usage was lower.

It's difficult to come up with a baseline that works for everyone, our
best recommendation would be to monitor resource usage and set appropriate
reqs/limits for your environment. Has the Pod remained stable since
removing the default reqs/limits?

@suleymanakbas91 https://github.com/suleymanakbas91 are you still
experiencing this issue?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/velero/issues/1856?email_source=notifications&email_token=ABUQDHVP2ZANIFEWLRFWQ6TQLE2OXA5CNFSM4IUZH2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ML6HY#issuecomment-534298399,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUQDHQXXTW7JEOPXO2UDBLQLE2OXANCNFSM4IUZH2AA
.

>

Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177

Crevil on 24 Sep 2019

👀1 👍1

Closing this out as inactive. Feel free to reach out again as needed.

skriss on 30 Sep 2019

👍1

I'm trying to create a Velero backup with feature CSI enabled on my azure cluster. I'm following the same instructions from the documentation.

I observed the backup has actually completed and failed at the end with the same transport error mentioned in this thread. Here are the log events that I have observed on the velero pod.

time="2020-06-17T10:59:14Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
time="2020-06-17T10:59:14Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/csi-try4 logSource="pkg/controller/backup_controller.go:273"

Any idea on what went wrong during the backup?

SameeraGrandhi on 17 Jun 2020

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.

velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"

Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

swe-pawe on 18 Jun 2020

👍1

I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren't playing nice together. Let us know if that fixes things!

skriss on 18 Jun 2020

❤1

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!

swe-pawe on 18 Jun 2020

👍4 🚀1

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.
velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"
Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0

Same here.
AKS 1.16
velero1.4

confirm: increasing the limits fixed the issue

Berndinox on 22 Jun 2020

@vmware-tanzu/velero-maintainers I'm guessing we should lower the value for this setting. I set it at 100MB since that's the max Azure allows, which means Velero will create the minimum number of chunks, but I think it's causing Velero to exceed its default limits regularly.

We could probably drop the chunk size down to something significantly smaller and it wouldn't have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.

skriss on 22 Jun 2020

👍3