What steps did you take and what happened:
We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:
time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"
Any idea why this happens and is there anything we can do to prevent this?
Anything else you would like to add:
Here is the backup file we use:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: kyma-backup
namespace: kyma-system
spec:
includedNamespaces:
- '*'
includedResources:
- '*'
includeClusterResources: true
storageLocation: default
volumeSnapshotLocations:
- default
We just deploy this file to the cluster using kubectl apply -f.
Environment:
velero version): 1.0.0kubectl version): 1.13.9-gke.3/etc/os-release):I believe the gRPC error comes from the trying to talk to a plugin. If you add the --log-level debug flag to your velero server Pod, we might be able to get more info about what's happening here.
looks like the same issue as #481
It's strange because we are not using any plugins.
Here are the error logs after I set the log level as debug:
$ kubectl logs backup-75d69b8644-fjlz7 -n kyma-system | grep error
time="2019-09-10T14:03:28Z" level=error msg="reading plugin stderr" cmd=/velero controller=backup-sync error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=error msg="reading plugin stderr" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="read |0: file already closed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:89" pluginName=velero
time="2019-09-10T14:03:35Z" level=debug msg="plugin process exited" backup=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 cmd=/velero error="signal: killed" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:74" path=/velero pid=136
Failed to fire hook: object logged as error does not satisfy error interface
time="2019-09-10T14:03:36Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=kyma-system/0e9fce5e-309e-4afd-bda0-d5b621b56cc5 logSource="pkg/controller/backup_controller.go:230"
@suleymanakbas91 good to know you don't have any external plugins, but Velero does include a set of default plugins so this is coming from those.
cc @skriss @nrb @carlisia any ideas?
@suleymanakbas91 do you have CPU/mem requests/limits defined for your Velero deployment?
@skriss we don't specify any. You can also check out the helm chart we use from here: https://github.com/kyma-project/kyma/tree/master/resources/backup
We have run in to this as well and I just tried removing CPU/mem limits for the deployment and it looks much more stable.
@Crevil @suleymanakbas91 are you able to see how much CPU/mem the Velero Pod is using?
Sure thing @prydonius

We run backups every hour as seen by the spikes.
@Crevil thanks for that! Were you previously using the default requests/limits provided by velero install? Your usage definitely goes over those defaults, could you tell us a little more about what you're including in your backups (e.g. no. of resources, whether you're using restic or not, and volume sizes if using restic)? Just trying to gauge if it makes sense to increase the default req/limits.
I’ll ger back to you on monday with more details.
On Sat, 21 Sep 2019 at 01.39, Adnan Abdulhussein notifications@github.com
wrote:
@Crevil https://github.com/Crevil thanks for that! Were you previously
using the default requests/limits provided by velero install? Your usage
definitely goes over those defaults, could you tell us a little more about
what you're including in your backups (e.g. no. of resources, whether
you're using restic or not, and volume sizes if using restic)? Just trying
to gauge if it makes sense to increase the default req/limits.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/velero/issues/1856?email_source=notifications&email_token=ABUQDHW2AVJBDE7UW2ZUPRLQKVNJRA5CNFSM4IUZH2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IEZ3Q#issuecomment-533744878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUQDHV5M7Q64CLLDI4CMFDQKVNJRANCNFSM4IUZH2AA
.>
Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177
We were using limits as specified from the install command. Here is a shortened version of what we were running.
apiVersion: apps/v1
kind: Deployment
metadata:
name: velero
spec:
template:
spec:
containers:
- name: velero
resources:
limits:
cpu: "1"
memory: 256Mi
requests:
cpu: 500m
memory: 128Mi
We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots.
We are not using restic.
Let me know if you need any thing else.
Really appreciate the info @Crevil. Strange, your backups are smaller than this user's, though their resource usage was lower.
It's difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits?
@suleymanakbas91 are you still experiencing this issue?
It looks stable now, yes. We’ll set up an alert on the consumption to be
warned in the future.
On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein notifications@github.com
wrote:
Really appreciate the info @Crevil https://github.com/Crevil. Strange,
your backups are smaller than this user's
https://github.com/heptio/velero/issues/94#issuecomment-514561793,
though their resource usage was lower.It's difficult to come up with a baseline that works for everyone, our
best recommendation would be to monitor resource usage and set appropriate
reqs/limits for your environment. Has the Pod remained stable since
removing the default reqs/limits?@suleymanakbas91 https://github.com/suleymanakbas91 are you still
experiencing this issue?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/heptio/velero/issues/1856?email_source=notifications&email_token=ABUQDHVP2ZANIFEWLRFWQ6TQLE2OXA5CNFSM4IUZH2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ML6HY#issuecomment-534298399,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUQDHQXXTW7JEOPXO2UDBLQLE2OXANCNFSM4IUZH2AA
.>
Hilsen
Bjørn Sørensen
Tlf.: (+45) 28447177
Closing this out as inactive. Feel free to reach out again as needed.
I'm trying to create a Velero backup with feature CSI enabled on my azure cluster. I'm following the same instructions from the documentation.
I observed the backup has actually completed and failed at the end with the same transport error mentioned in this thread. Here are the log events that I have observed on the velero pod.
time="2020-06-17T10:59:14Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
time="2020-06-17T10:59:14Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/csi-try4 logSource="pkg/controller/backup_controller.go:273"
Any idea on what went wrong during the backup?
Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.
velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"
Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0
I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren't playing nice together. Let us know if that fixes things!
Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!
Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619" velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"Just to be complete:
Velero: 1.4.0
Azure-Plugin for Velero: 1.1.0
Same here.
AKS 1.16
velero1.4
confirm: increasing the limits fixed the issue
@vmware-tanzu/velero-maintainers I'm guessing we should lower the value for this setting. I set it at 100MB since that's the max Azure allows, which means Velero will create the minimum number of chunks, but I think it's causing Velero to exceed its default limits regularly.
We could probably drop the chunk size down to something significantly smaller and it wouldn't have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.
Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!
This has solved my problem...
Velero 1.5.1
Azure plugin: 1.1.0
Most helpful comment
Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now...
Thanks a lot for your fast reply!