Velero: Memory usage keeps increasing unbounded when running backups on a schedule

Created on 15 Nov 2019 · 23Comments · Source: vmware-tanzu/velero

What steps did you take and what happened:
We're running backups on a schedule and have observed that the memory usage of the Velero Pod keeps increasing. We see backups and restores start failing once the memory usage reaches the memory limit of the Velero Pod (we're using velero install's default mem limit of 256Mi). Is there a possible memory leak since the memory usage of the Velero Pod keeps increasing unbounded when these scheduled backups run?

This issue is closely related to #780. Below is a Grafana dashboard tracking the memory usage of the Velero pod. Memory usage keeps increasing until the mem limit of 256Mi is reached and at this point we observe backups and restores start failing due to OOM issues.

velero-ns-metrics

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Environment:

Velero version (use velero version): v1.2.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release):

Bug Needs info P1 - Important

Source

betta1

Most helpful comment

Hi @ashish-amarnath, I think I've been able to find a way to reproduce this. I ran restores every 10s for ~25 mins and observed Velero Pod memory usage reached the default memory limit of 256MiB. I'm running Velero v1.2 on K8s cluster on AWS using velero-plugin-for-aws v1.0.0, below are the steps I ran:

kubectl create ns nginx
kubectl run nginx --image=nginx -n nginx

velero create backup nginx-backup --include-namespaces nginx --snapshot-volumes=false

while true; do velero restore create --from-backup=nginx-backup --namespace-mappings nginx:nginx-$(date +%s); sleep 10; done

After running ~ 230 restores I started seeing some plugin timeout errors:

k logs deploy/velero -n velero | grep error
time="2020-02-05T16:21:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error=**"timeout while waiting for plugin to start**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:27:37Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="fork/exec /plugins/velero-plugin-for-aws: **cannot allocate memory**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:29:07Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:36:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:38:18Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:39:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:44:08Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:45:12Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:47:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:48:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:52:26Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:52:45Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:54:02Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:54:27Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:55:33Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:57:03Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:58:55Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T17:00:25Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:03:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:05:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:06:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:08:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:09:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:11:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:12:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:14:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:15:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"

Heap related metrics from pprof:

go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 8:58am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 191.20MB, 95.02% of 201.22MB total
Dropped 89 nodes (cum <= 1.01MB)
Showing top 10 nodes out of 42
      flat  flat%   sum%        cum   cum%
  155.13MB 77.09% 77.09%   189.20MB 94.03%  compress/flate.NewWriter
   32.07MB 15.94% 93.03%    32.07MB 15.94%  compress/flate.(*compressor).initDeflate
    1.50MB  0.75% 93.78%     1.50MB  0.75%  github.com/vmware-tanzu/velero/vendor/github.com/json-iterator/go.(*Iterator).ReadString
       1MB   0.5% 94.27%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt.(*restartableProcess).getByKindAndNameLH
       1MB   0.5% 94.77%        2MB  0.99%  compress/flate.newHuffmanBitWriter
    0.50MB  0.25% 95.02%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/util/collections.NewIncludesExcludes
         0     0% 95.02%    34.07MB 16.93%  compress/flate.(*compressor).init
         0     0% 95.02%   189.20MB 94.03%  compress/gzip.(*Writer).Write
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).Run.func2
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).processNextWorkItem
(pprof)

I've uploaded the velero-go-memstats.txt to https://gist.github.com/betta1/b8198b56d20311279899f4f68a06060d, please also find velero logs at https://gist.github.com/betta1/d6d6039af3df9bfe8378adfbcc852741.

Grafana metrics while restores are running:
velero-ns-metrics

betta1 on 5 Feb 2020

👍4

All 23 comments

@betta1 do you know if this is a new issue with v1.2 or whether it existed in v1.1 as well? If you have this info, it would help with debugging.

skriss on 15 Nov 2019

Also - can you do the following:

Run kubectl -n velero port-forward deployment/velero 6060:6060
Run go tool pprof http://localhost:6060/debug/pprof/heap
Run top and dump the output here

skriss on 15 Nov 2019

I think it existed in v1.1 as well since we had scheduled backups running fine for a week or so then we started seeing intermittent failures both for backups and restores.

betta1 on 15 Nov 2019

Hey Kriss,

I am working with Antony on this issue right now. Please take a look at the below output of pprof.

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/cheny48/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: velero
Type: inuse_space
Time: Nov 15, 2019 at 9:21am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 81.31MB, 93.66% of 86.81MB total
Showing top 10 nodes out of 131
      flat  flat%   sum%        cum   cum%
   61.70MB 71.08% 71.08%    77.16MB 88.89%  compress/flate.NewWriter
   14.96MB 17.24% 88.32%    14.96MB 17.24%  compress/flate.(*compressor).initDeflate
       1MB  1.15% 89.47%        1MB  1.15%  runtime.malg
    0.64MB  0.73% 90.20%     0.64MB  0.73%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).frameScratchBuffer
    0.50MB  0.58% 90.78%     0.50MB  0.58%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
    0.50MB  0.58% 91.36%     0.50MB  0.58%  compress/flate.(*huffmanEncoder).generate
    0.50MB  0.58% 91.93%     0.50MB  0.58%  github.com/vmware-tanzu/velero/pkg/restore.prioritizeResources
    0.50MB  0.58% 92.51%     0.50MB  0.58%  compress/flate.newHuffmanBitWriter
    0.50MB  0.58% 93.09%     0.50MB  0.58%  fmt.Sprintf
    0.50MB  0.58% 93.66%        1MB  1.15%  github.com/vmware-tanzu/velero/pkg/restore.resolveActions
(pprof)

I run more backup and provide the output once it's hit the memory limit.

Frank51 on 15 Nov 2019

Hi Kriss,

This is the output when Velero pod reaches its memory limit.

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
File: velero
Type: inuse_space
Time: Nov 19, 2019 at 11:46am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 209.56MB, 93.23% of 224.78MB total
Dropped 104 nodes (cum <= 1.12MB)
Showing top 10 nodes out of 85
      flat  flat%   sum%        cum   cum%
  153.37MB 68.23% 68.23%   188.05MB 83.66%  compress/flate.NewWriter
   34.18MB 15.21% 83.44%    34.18MB 15.21%  compress/flate.(*compressor).initDeflate
       5MB  2.23% 85.67%        5MB  2.23%  fmt.Sprintf
       4MB  1.78% 87.45%        4MB  1.78%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).newStream
    3.50MB  1.56% 89.00%     3.50MB  1.56%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
       3MB  1.34% 90.34%        3MB  1.34%  reflect.New
       2MB  0.89% 91.23%     5.50MB  2.45%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneRequest
    1.50MB  0.67% 91.90%     1.50MB  0.67%  runtime.malg
    1.50MB  0.67% 92.56%     1.50MB  0.67%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
    1.50MB  0.67% 93.23%     1.50MB  0.67%  net/url.parse

Frank51 on 19 Nov 2019

@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.

skriss on 9 Dec 2019

@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.

So far we don't have any other info about that.

Frank51 on 9 Dec 2019

Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.

skriss on 9 Dec 2019

Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.

Could you give more detail about the meaning of hooks? I think I just set the memory limit is 256 MB.

Frank51 on 9 Dec 2019

A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.

If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key pre.hook.backup.velero.io/command.

skriss on 9 Dec 2019

A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.

If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key pre.hook.backup.velero.io/command.

So, that means if I try to disable the hook before running the container in the pod, is this memory leak issue will be solved?

Frank51 on 9 Dec 2019

If I disable hooks, then I don't see the memory leak issue. Just wondering if you can confirm the same for debugging purposes.

skriss on 9 Dec 2019

It seems like this leak is coming from either the Kube libraries or the https://github.com/docker/spdystream dependency that's used for executing hooks. SPDY is deprecated, so need to look into whether there's another approach to implementing this.

skriss on 9 Dec 2019

@skriss we were not running any pre/post hooks when we hit this issue. One possible way to reproduce this is to frequently run backups and restores on a schedule -- we observed the memory usage of the Velero Pod keeps increasing until it reaches the memory limit of the Pod when backups/restores start consistently failing.

betta1 on 23 Jan 2020

we were not running any pre/post hooks when we hit this issue.

Hmm, OK -- the signs I saw in the profiles pointed to an issue with the hook code, but sounds like that shouldn't be the case. We'll have to dig further.

skriss on 30 Jan 2020

Hi @betta1 and @Frank51 👋
I did some more digging into this but I wasn't able to find a memory leak. Here is what I ran:
I had Velero running on a KinD cluster using the velero-plugin-for-aws, using the S3 as the object store.

In the experiment the created backups had backup tarball size of ~100K: a full backup of the cluster including 5 deployments of NGINX of workload.
There were no volumes that needed to be backed up
There were no restore operations running during my experiment.

As a part of my experiment, backups were created every 10s for 30m.

These are some of the go_memstats_ metrics I scrapped from the prometheus metrics endpoint while the experiment was running.

go_memstats_alloc_bytes
go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes
go_memstats_frees_total
go_memstats_gc_cpu_fraction
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_objects
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads

Further, while the experiment was running, heap related metrics from pprof was also gather to inspect top memory allocators. Following was the output

$  go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: velero
Type: inuse_space
Time: Jan 29, 2020 at 4:55pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
 1805.17kB 32.64% 32.64%  1805.17kB 32.64%  compress/flate.NewWriter
  650.62kB 11.77% 44.41%   650.62kB 11.77%  bytes.makeSlice
  513.56kB  9.29% 53.70%   513.56kB  9.29%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  512.17kB  9.26% 62.96%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
  512.17kB  9.26% 72.22%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
  512.14kB  9.26% 81.48%   512.14kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
  512.05kB  9.26% 90.74%   512.05kB  9.26%  strings.(*Builder).WriteString
  512.01kB  9.26%   100%   512.01kB  9.26%  github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).Flush
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).WriteHeader
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
 1805.17kB 32.64% 32.64%  1805.17kB 32.64%  compress/flate.NewWriter
  650.62kB 11.77% 44.41%   650.62kB 11.77%  bytes.makeSlice
  513.56kB  9.29% 53.70%   513.56kB  9.29%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  512.17kB  9.26% 62.96%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
  512.17kB  9.26% 72.22%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
  512.14kB  9.26% 81.48%   512.14kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
  512.05kB  9.26% 90.74%   512.05kB  9.26%  strings.(*Builder).WriteString
  512.01kB  9.26%   100%   512.01kB  9.26%  github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).Flush
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).WriteHeader

The above output indicates that compress/flate.NewWriter is one of the more regular memory allocators. However, this is not indicative of a memory leak.

From the metrics the following conclusions were drawn:

No memory leak was found in the part of the code, exercised by the experiment.
Memory consumption of Velero increases with the number of backup objects that are present in the cluster. Similar conclusion may be extended to the number of restore objects.

To that effect, following up on what you have been experiencing, can you please clarify the following:

How many backup and restore objects to you have in your cluster?
What are the size of the backup tarballs?
If you are continuing to experience this problem, can you please run the below script to help us gather more metrics?

$ # setup port-forwarding for the velero metric server address
$ kubectl --namespace velero portforward deploy/velero 8085:8085
$ # while the port forward is active, in another terminal window
$ # scrape the above mentioned metrics into a file. Scrapping every 5s, for may be a couple of hours
$ while true; do curl http://localhost:8085/metrics | grep -E '^velero_backup_tarball_size_bytes|^go_memstats|^go_threads|^process_resident_memory_bytes|^process_virtual_memory_bytes' >> ~/velero-go-memstats.txt; echo "---" >> ~/velero-go-memstats.txt; sleep 5; done

and please share the ~/velero-go-memstats.txt file with us so that we can find out what's going on.

ashish-amarnath on 30 Jan 2020

kubectl create ns nginx
kubectl run nginx --image=nginx -n nginx

velero create backup nginx-backup --include-namespaces nginx --snapshot-volumes=false

while true; do velero restore create --from-backup=nginx-backup --namespace-mappings nginx:nginx-$(date +%s); sleep 10; done

After running ~ 230 restores I started seeing some plugin timeout errors:

k logs deploy/velero -n velero | grep error
time="2020-02-05T16:21:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error=**"timeout while waiting for plugin to start**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:27:37Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="fork/exec /plugins/velero-plugin-for-aws: **cannot allocate memory**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:29:07Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:36:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:38:18Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:39:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:44:08Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:45:12Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:47:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:48:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:52:26Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:52:45Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:54:02Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:54:27Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:55:33Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:57:03Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:58:55Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T17:00:25Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:03:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:05:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:06:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:08:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:09:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:11:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:12:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:14:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:15:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"

Heap related metrics from pprof:

go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 8:58am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 191.20MB, 95.02% of 201.22MB total
Dropped 89 nodes (cum <= 1.01MB)
Showing top 10 nodes out of 42
      flat  flat%   sum%        cum   cum%
  155.13MB 77.09% 77.09%   189.20MB 94.03%  compress/flate.NewWriter
   32.07MB 15.94% 93.03%    32.07MB 15.94%  compress/flate.(*compressor).initDeflate
    1.50MB  0.75% 93.78%     1.50MB  0.75%  github.com/vmware-tanzu/velero/vendor/github.com/json-iterator/go.(*Iterator).ReadString
       1MB   0.5% 94.27%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt.(*restartableProcess).getByKindAndNameLH
       1MB   0.5% 94.77%        2MB  0.99%  compress/flate.newHuffmanBitWriter
    0.50MB  0.25% 95.02%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/util/collections.NewIncludesExcludes
         0     0% 95.02%    34.07MB 16.93%  compress/flate.(*compressor).init
         0     0% 95.02%   189.20MB 94.03%  compress/gzip.(*Writer).Write
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).Run.func2
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).processNextWorkItem
(pprof)

Grafana metrics while restores are running:
velero-ns-metrics

betta1 on 5 Feb 2020

👍4

Thanks @betta1! I'll take a look at this.

ashish-amarnath on 5 Feb 2020

👍1

@betta1, Thank you for all the information on this.
From the metrics you shared, there definitely seems to be a memory leak somewhere.

Also

  155.13MB 77.09% 77.09%   189.20MB 94.03%  compress/flate.NewWriter

is indicating the same.

To figure out where this leak may be coming from, I tried to re-run your test, for about 1h, but my pprof top looks different from what you saw and was also changing at various points in time showing compress/flate.NewWriter as the top allocator only sometimes.

Furthermore, the memory footprint of my test is in the order of KB as against MB in your case.

(pprof) top
Showing nodes accounting for 5339.04kB, 83.91% of 6363.09kB total
Showing top 10 nodes out of 107

At the times when compress/flate.NewWriter was among the top allocators and looking the at the svg output format, it indicated that the code path leading to the NewWriter function call was a log statement

So here is my ask:

I am using an image built from master as of today, quay.io/ashish_amarnath/velero:02052020-00. Can you please confirm the version of velero you are running and can you please try to run your experiment using an image built from master? If you are running my image, please be sure to run this in non-critical environments.
Can you please share the /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz pprof profile file? That will help in investigate this further.
What log level are you running velero at?

ashish-amarnath on 6 Feb 2020

Update:
I was able to repro this leak by re-running the experiment using image: velero/velero:v1.2.0
Looking over at the pprof output:

$ go tool pprof --source_path ~/go/src/github.com/vmware-tanzu/velero/  http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 10:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 148.85MB, 99.33% of 149.85MB total
Dropped 29 nodes (cum <= 0.75MB)
Showing top 10 nodes out of 31
      flat  flat%   sum%        cum   cum%
  117.23MB 78.23% 78.23%   143.84MB 95.99%  compress/flate.NewWriter
   25.11MB 16.75% 94.99%    25.11MB 16.75%  compress/flate.(*compressor).initDeflate
    4.01MB  2.68% 97.66%     4.01MB  2.68%  compress/flate.(*huffmanEncoder).generate
    1.50MB  1.00% 98.66%     1.50MB  1.00%  compress/flate.newHuffmanBitWriter
       1MB  0.67% 99.33%        1MB  0.67%  runtime.malg
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*Writer).Close
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).close
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).deflate
         0     0% 99.33%    26.61MB 17.76%  compress/flate.(*compressor).init
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).writeBlock
(pprof) svg
Generating report in profile005.svg
(pprof) list runValidatedRestore
Total: 149.85MB
ROUTINE ======================== github.com/vmware-tanzu/velero/pkg/controller.(*restoreController).runValidatedRestore in pkg/controller/restore_controller.go
         0   148.35MB (flat, cum) 99.00% of Total
         .          .    453:   volumeSnapshots, err := info.backupStore.GetBackupVolumeSnapshots(restore.Spec.BackupName)
         .          .    454:   if err != nil {
         .          .    455:       return errors.Wrap(err, "error fetching volume snapshots metadata")
         .          .    456:   }
         .          .    457:
         .   143.84MB    458:   restoreLog.Info("starting restore")
         .          .    459:
         .          .    460:   var podVolumeBackups []*velerov1api.PodVolumeBackup
         .          .    461:   for i := range podVolumeBackupList.Items {
         .          .    462:       podVolumeBackups = append(podVolumeBackups, &podVolumeBackupList.Items[i])
         .          .    463:   }
         .          .    464:   restoreReq := pkgrestore.Request{
         .          .    465:       Log:              restoreLog,
         .          .    466:       Restore:          restore,
         .          .    467:       Backup:           info.backup,
         .          .    468:       PodVolumeBackups: podVolumeBackups,
         .          .    469:       VolumeSnapshots:  volumeSnapshots,
         .          .    470:       BackupReader:     backupFile,
         .          .    471:   }
         .   512.07kB    472:   restoreWarnings, restoreErrors := c.restorer.Restore(restoreReq, actions, c.snapshotLocationLister, pluginManager)
         .          .    473:   restoreLog.Info("restore completed")
         .          .    474:
         .     4.01MB    475:   if logReader, err := restoreLog.done(c.logger); err != nil {
         .          .    476:       restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error getting restore log reader: %v", err))
         .          .    477:   } else {
         .          .    478:       if err := info.backupStore.PutRestoreLog(restore.Spec.BackupName, restore.Name, logReader); err != nil {
         .          .    479:           restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error uploading log file to backup storage: %v", err))
         .          .    480:       }
(pprof)

from the above output the leak is coming from https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458

@betta1
Considering this leak didn't show up in master, can you please try running the test using an image built from master?
Furthermore, as https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458 hasn't changed since v1.2.0 I am suspecting one of our dependency to have fixed this. I will find what change fixed this in master as against v1.2.0

ashish-amarnath on 6 Feb 2020

Hi @ashish-amarnath, yes you're right I'm not able to reproduce this as well with the most recent release v1.3.0-beta.1. I'll run the tests some more and post the update here, so far looks like memory is released back to the OS post backup/restore unlike before.

betta1 on 6 Feb 2020

👍2

@betta1 Thanks for confirming.

ashish-amarnath on 6 Feb 2020

Closing the loop on this.
The leak was fixed as a part of this PR #2201.
https://github.com/vmware-tanzu/velero/pull/2201/files#diff-07ac758a0abc3d782b2530a9ad08ae5cL63-L67

Without this fix, during all restore, the program would wait _forever_ for the errChan to be closed. Causing the gzip writer to be held in memory. This code path would have executed on every restore. This lines up with our observation of the leak on performing restores in a loop.

Going to mark this issue as fixed.

ashish-amarnath on 7 Feb 2020

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings