What steps did you take and what happened:
We're running backups on a schedule and have observed that the memory usage of the Velero Pod keeps increasing. We see backups and restores start failing once the memory usage reaches the memory limit of the Velero Pod (we're using velero install's default mem limit of 256Mi). Is there a possible memory leak since the memory usage of the Velero Pod keeps increasing unbounded when these scheduled backups run?
This issue is closely related to #780. Below is a Grafana dashboard tracking the memory usage of the Velero pod. Memory usage keeps increasing until the mem limit of 256Mi is reached and at this point we observe backups and restores start failing due to OOM issues.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velerovelero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yamlvelero backup logs <backupname>velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yamlvelero restore logs <restorename>Environment:
velero version): v1.2.0velero client config get features): kubectl version):/etc/os-release):@betta1 do you know if this is a new issue with v1.2 or whether it existed in v1.1 as well? If you have this info, it would help with debugging.
Also - can you do the following:
kubectl -n velero port-forward deployment/velero 6060:6060go tool pprof http://localhost:6060/debug/pprof/heaptop and dump the output hereI think it existed in v1.1 as well since we had scheduled backups running fine for a week or so then we started seeing intermittent failures both for backups and restores.
Hey Kriss,
I am working with Antony on this issue right now. Please take a look at the below output of pprof.
$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/cheny48/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: velero
Type: inuse_space
Time: Nov 15, 2019 at 9:21am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 81.31MB, 93.66% of 86.81MB total
Showing top 10 nodes out of 131
flat flat% sum% cum cum%
61.70MB 71.08% 71.08% 77.16MB 88.89% compress/flate.NewWriter
14.96MB 17.24% 88.32% 14.96MB 17.24% compress/flate.(*compressor).initDeflate
1MB 1.15% 89.47% 1MB 1.15% runtime.malg
0.64MB 0.73% 90.20% 0.64MB 0.73% github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).frameScratchBuffer
0.50MB 0.58% 90.78% 0.50MB 0.58% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
0.50MB 0.58% 91.36% 0.50MB 0.58% compress/flate.(*huffmanEncoder).generate
0.50MB 0.58% 91.93% 0.50MB 0.58% github.com/vmware-tanzu/velero/pkg/restore.prioritizeResources
0.50MB 0.58% 92.51% 0.50MB 0.58% compress/flate.newHuffmanBitWriter
0.50MB 0.58% 93.09% 0.50MB 0.58% fmt.Sprintf
0.50MB 0.58% 93.66% 1MB 1.15% github.com/vmware-tanzu/velero/pkg/restore.resolveActions
(pprof)
I run more backup and provide the output once it's hit the memory limit.
Hi Kriss,
This is the output when Velero pod reaches its memory limit.
$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
File: velero
Type: inuse_space
Time: Nov 19, 2019 at 11:46am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 209.56MB, 93.23% of 224.78MB total
Dropped 104 nodes (cum <= 1.12MB)
Showing top 10 nodes out of 85
flat flat% sum% cum cum%
153.37MB 68.23% 68.23% 188.05MB 83.66% compress/flate.NewWriter
34.18MB 15.21% 83.44% 34.18MB 15.21% compress/flate.(*compressor).initDeflate
5MB 2.23% 85.67% 5MB 2.23% fmt.Sprintf
4MB 1.78% 87.45% 4MB 1.78% github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).newStream
3.50MB 1.56% 89.00% 3.50MB 1.56% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
3MB 1.34% 90.34% 3MB 1.34% reflect.New
2MB 0.89% 91.23% 5.50MB 2.45% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneRequest
1.50MB 0.67% 91.90% 1.50MB 0.67% runtime.malg
1.50MB 0.67% 92.56% 1.50MB 0.67% github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
1.50MB 0.67% 93.23% 1.50MB 0.67% net/url.parse
@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.
@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.
So far we don't have any other info about that.
Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.
Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.
Could you give more detail about the meaning of hooks? I think I just set the memory limit is 256 MB.
A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.
If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key pre.hook.backup.velero.io/command.
A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.
If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key
pre.hook.backup.velero.io/command.
So, that means if I try to disable the hook before running the container in the pod, is this memory leak issue will be solved?
If I disable hooks, then I don't see the memory leak issue. Just wondering if you can confirm the same for debugging purposes.
It seems like this leak is coming from either the Kube libraries or the https://github.com/docker/spdystream dependency that's used for executing hooks. SPDY is deprecated, so need to look into whether there's another approach to implementing this.
@skriss we were not running any pre/post hooks when we hit this issue. One possible way to reproduce this is to frequently run backups and restores on a schedule -- we observed the memory usage of the Velero Pod keeps increasing until it reaches the memory limit of the Pod when backups/restores start consistently failing.
we were not running any pre/post hooks when we hit this issue.
Hmm, OK -- the signs I saw in the profiles pointed to an issue with the hook code, but sounds like that shouldn't be the case. We'll have to dig further.
Hi @betta1 and @Frank51 馃憢
I did some more digging into this but I wasn't able to find a memory leak. Here is what I ran:
I had Velero running on a KinD cluster using the velero-plugin-for-aws, using the S3 as the object store.
In the experiment the created backups had backup tarball size of ~100K: a full backup of the cluster including 5 deployments of NGINX of workload.
There were no volumes that needed to be backed up
There were no restore operations running during my experiment.
As a part of my experiment, backups were created every 10s for 30m.
These are some of the go_memstats_ metrics I scrapped from the prometheus metrics endpoint while the experiment was running.
go_memstats_alloc_bytes
go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes
go_memstats_frees_total
go_memstats_gc_cpu_fraction
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_objects
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
Further, while the experiment was running, heap related metrics from pprof was also gather to inspect top memory allocators. Following was the output
$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: velero
Type: inuse_space
Time: Jan 29, 2020 at 4:55pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
flat flat% sum% cum cum%
1805.17kB 32.64% 32.64% 1805.17kB 32.64% compress/flate.NewWriter
650.62kB 11.77% 44.41% 650.62kB 11.77% bytes.makeSlice
513.56kB 9.29% 53.70% 513.56kB 9.29% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
512.17kB 9.26% 62.96% 512.17kB 9.26% github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
512.17kB 9.26% 72.22% 512.17kB 9.26% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
512.14kB 9.26% 81.48% 512.14kB 9.26% github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
512.05kB 9.26% 90.74% 512.05kB 9.26% strings.(*Builder).WriteString
512.01kB 9.26% 100% 512.01kB 9.26% github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
0 0% 100% 902.59kB 16.32% archive/tar.(*Writer).Flush
0 0% 100% 902.59kB 16.32% archive/tar.(*Writer).WriteHeader
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
flat flat% sum% cum cum%
1805.17kB 32.64% 32.64% 1805.17kB 32.64% compress/flate.NewWriter
650.62kB 11.77% 44.41% 650.62kB 11.77% bytes.makeSlice
513.56kB 9.29% 53.70% 513.56kB 9.29% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
512.17kB 9.26% 62.96% 512.17kB 9.26% github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
512.17kB 9.26% 72.22% 512.17kB 9.26% github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
512.14kB 9.26% 81.48% 512.14kB 9.26% github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
512.05kB 9.26% 90.74% 512.05kB 9.26% strings.(*Builder).WriteString
512.01kB 9.26% 100% 512.01kB 9.26% github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
0 0% 100% 902.59kB 16.32% archive/tar.(*Writer).Flush
0 0% 100% 902.59kB 16.32% archive/tar.(*Writer).WriteHeader
The above output indicates that compress/flate.NewWriter is one of the more regular memory allocators. However, this is not indicative of a memory leak.
From the metrics the following conclusions were drawn:
To that effect, following up on what you have been experiencing, can you please clarify the following:
$ # setup port-forwarding for the velero metric server address
$ kubectl --namespace velero portforward deploy/velero 8085:8085
$ # while the port forward is active, in another terminal window
$ # scrape the above mentioned metrics into a file. Scrapping every 5s, for may be a couple of hours
$ while true; do curl http://localhost:8085/metrics | grep -E '^velero_backup_tarball_size_bytes|^go_memstats|^go_threads|^process_resident_memory_bytes|^process_virtual_memory_bytes' >> ~/velero-go-memstats.txt; echo "---" >> ~/velero-go-memstats.txt; sleep 5; done
and please share the ~/velero-go-memstats.txt file with us so that we can find out what's going on.
Hi @ashish-amarnath, I think I've been able to find a way to reproduce this. I ran restores every 10s for ~25 mins and observed Velero Pod memory usage reached the default memory limit of 256MiB. I'm running Velero v1.2 on K8s cluster on AWS using velero-plugin-for-aws v1.0.0, below are the steps I ran:
kubectl create ns nginx
kubectl run nginx --image=nginx -n nginx
velero create backup nginx-backup --include-namespaces nginx --snapshot-volumes=false
while true; do velero restore create --from-backup=nginx-backup --namespace-mappings nginx:nginx-$(date +%s); sleep 10; done
After running ~ 230 restores I started seeing some plugin timeout errors:
k logs deploy/velero -n velero | grep error
time="2020-02-05T16:21:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error=**"timeout while waiting for plugin to start**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:27:37Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="fork/exec /plugins/velero-plugin-for-aws: **cannot allocate memory**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:29:07Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:36:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:38:18Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:39:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:44:08Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:45:12Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:47:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:48:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:52:26Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:52:45Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:54:02Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:54:27Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:55:33Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:57:03Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:58:55Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T17:00:25Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:03:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:05:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:06:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:08:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:09:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:11:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:12:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:14:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:15:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
Heap related metrics from pprof:
go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 8:58am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 191.20MB, 95.02% of 201.22MB total
Dropped 89 nodes (cum <= 1.01MB)
Showing top 10 nodes out of 42
flat flat% sum% cum cum%
155.13MB 77.09% 77.09% 189.20MB 94.03% compress/flate.NewWriter
32.07MB 15.94% 93.03% 32.07MB 15.94% compress/flate.(*compressor).initDeflate
1.50MB 0.75% 93.78% 1.50MB 0.75% github.com/vmware-tanzu/velero/vendor/github.com/json-iterator/go.(*Iterator).ReadString
1MB 0.5% 94.27% 1.50MB 0.75% github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt.(*restartableProcess).getByKindAndNameLH
1MB 0.5% 94.77% 2MB 0.99% compress/flate.newHuffmanBitWriter
0.50MB 0.25% 95.02% 1.50MB 0.75% github.com/vmware-tanzu/velero/pkg/util/collections.NewIncludesExcludes
0 0% 95.02% 34.07MB 16.93% compress/flate.(*compressor).init
0 0% 95.02% 189.20MB 94.03% compress/gzip.(*Writer).Write
0 0% 95.02% 196.71MB 97.75% github.com/vmware-tanzu/velero/pkg/controller.(*genericController).Run.func2
0 0% 95.02% 196.71MB 97.75% github.com/vmware-tanzu/velero/pkg/controller.(*genericController).processNextWorkItem
(pprof)
I've uploaded the velero-go-memstats.txt to https://gist.github.com/betta1/b8198b56d20311279899f4f68a06060d, please also find velero logs at https://gist.github.com/betta1/d6d6039af3df9bfe8378adfbcc852741.
Grafana metrics while restores are running:

Thanks @betta1! I'll take a look at this.
@betta1, Thank you for all the information on this.
From the metrics you shared, there definitely seems to be a memory leak somewhere.

Also
聽 155.13MB 77.09% 77.09% 聽 189.20MB 94.03% 聽compress/flate.NewWriter
is indicating the same.
To figure out where this leak may be coming from, I tried to re-run your test, for about 1h, but my pprof top looks different from what you saw and was also changing at various points in time showing compress/flate.NewWriter as the top allocator only sometimes.
Furthermore, the memory footprint of my test is in the order of KB as against MB in your case.
(pprof) top
Showing nodes accounting for 5339.04kB, 83.91% of 6363.09kB total
Showing top 10 nodes out of 107
At the times when compress/flate.NewWriter was among the top allocators and looking the at the svg output format, it indicated that the code path leading to the NewWriter function call was a log statement

So here is my ask:
/home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz pprof profile file? That will help in investigate this further.Update:
I was able to repro this leak by re-running the experiment using image: velero/velero:v1.2.0
Looking over at the pprof output:
$ go tool pprof --source_path ~/go/src/github.com/vmware-tanzu/velero/ http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 10:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 148.85MB, 99.33% of 149.85MB total
Dropped 29 nodes (cum <= 0.75MB)
Showing top 10 nodes out of 31
flat flat% sum% cum cum%
117.23MB 78.23% 78.23% 143.84MB 95.99% compress/flate.NewWriter
25.11MB 16.75% 94.99% 25.11MB 16.75% compress/flate.(*compressor).initDeflate
4.01MB 2.68% 97.66% 4.01MB 2.68% compress/flate.(*huffmanEncoder).generate
1.50MB 1.00% 98.66% 1.50MB 1.00% compress/flate.newHuffmanBitWriter
1MB 0.67% 99.33% 1MB 0.67% runtime.malg
0 0% 99.33% 4.01MB 2.68% compress/flate.(*Writer).Close
0 0% 99.33% 4.01MB 2.68% compress/flate.(*compressor).close
0 0% 99.33% 4.01MB 2.68% compress/flate.(*compressor).deflate
0 0% 99.33% 26.61MB 17.76% compress/flate.(*compressor).init
0 0% 99.33% 4.01MB 2.68% compress/flate.(*compressor).writeBlock
(pprof) svg
Generating report in profile005.svg
(pprof) list runValidatedRestore
Total: 149.85MB
ROUTINE ======================== github.com/vmware-tanzu/velero/pkg/controller.(*restoreController).runValidatedRestore in pkg/controller/restore_controller.go
0 148.35MB (flat, cum) 99.00% of Total
. . 453: volumeSnapshots, err := info.backupStore.GetBackupVolumeSnapshots(restore.Spec.BackupName)
. . 454: if err != nil {
. . 455: return errors.Wrap(err, "error fetching volume snapshots metadata")
. . 456: }
. . 457:
. 143.84MB 458: restoreLog.Info("starting restore")
. . 459:
. . 460: var podVolumeBackups []*velerov1api.PodVolumeBackup
. . 461: for i := range podVolumeBackupList.Items {
. . 462: podVolumeBackups = append(podVolumeBackups, &podVolumeBackupList.Items[i])
. . 463: }
. . 464: restoreReq := pkgrestore.Request{
. . 465: Log: restoreLog,
. . 466: Restore: restore,
. . 467: Backup: info.backup,
. . 468: PodVolumeBackups: podVolumeBackups,
. . 469: VolumeSnapshots: volumeSnapshots,
. . 470: BackupReader: backupFile,
. . 471: }
. 512.07kB 472: restoreWarnings, restoreErrors := c.restorer.Restore(restoreReq, actions, c.snapshotLocationLister, pluginManager)
. . 473: restoreLog.Info("restore completed")
. . 474:
. 4.01MB 475: if logReader, err := restoreLog.done(c.logger); err != nil {
. . 476: restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error getting restore log reader: %v", err))
. . 477: } else {
. . 478: if err := info.backupStore.PutRestoreLog(restore.Spec.BackupName, restore.Name, logReader); err != nil {
. . 479: restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error uploading log file to backup storage: %v", err))
. . 480: }
(pprof)
from the above output the leak is coming from https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458
@betta1
Considering this leak didn't show up in master, can you please try running the test using an image built from master?
Furthermore, as https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458 hasn't changed since v1.2.0 I am suspecting one of our dependency to have fixed this. I will find what change fixed this in master as against v1.2.0
Hi @ashish-amarnath, yes you're right I'm not able to reproduce this as well with the most recent release v1.3.0-beta.1. I'll run the tests some more and post the update here, so far looks like memory is released back to the OS post backup/restore unlike before.
@betta1 Thanks for confirming.
Closing the loop on this.
The leak was fixed as a part of this PR #2201.
https://github.com/vmware-tanzu/velero/pull/2201/files#diff-07ac758a0abc3d782b2530a9ad08ae5cL63-L67
Without this fix, during all restore, the program would wait _forever_ for the errChan to be closed. Causing the gzip writer to be held in memory. This code path would have executed on every restore. This lines up with our observation of the leak on performing restores in a loop.
Going to mark this issue as fixed.
Most helpful comment
Hi @ashish-amarnath, I think I've been able to find a way to reproduce this. I ran restores every 10s for ~25 mins and observed Velero Pod memory usage reached the default memory limit of 256MiB. I'm running Velero v1.2 on K8s cluster on AWS using velero-plugin-for-aws v1.0.0, below are the steps I ran:
After running ~ 230 restores I started seeing some plugin timeout errors:
Heap related metrics from
pprof:I've uploaded the velero-go-memstats.txt to https://gist.github.com/betta1/b8198b56d20311279899f4f68a06060d, please also find velero logs at https://gist.github.com/betta1/d6d6039af3df9bfe8378adfbcc852741.
Grafana metrics while restores are running:
