Velero: Spike - Add cpu, memory requests & limits to deployment

Created on 14 Sep 2017  路  27Comments  路  Source: vmware-tanzu/velero

Measure ark's memory usage over time, set reasonable values

EnhancemenUser Good first issue Help wanted P2 - Long-term important ZD2064

Most helpful comment

image

Full cluster backup above including GCE volume snapshots

Namespaces: 317
Total Resources: 15199
Persistent Volumes: 45

All 27 comments

We should do this pre-1.0 and have some metrics for cluster-size and impact on CPU/Memory.

Just as a baseline comment - we are running 0.9.0 in Openshift 3.9. We initially were running without any limits when I realized the container had consumed 4gb of ram.

I set a limit of 1gb and now it's restarting with OOM issues and backups are failing. We have 100 scheduled backups (one for each namespace) and currently have 11,883 backups.

Is there a recommended memory setting? Is it based on schedules? Backups?

In case anyone is following the question from @twforeman, we've moved the discussion to #780

xref https://github.com/heptio/velero/issues/1452#issuecomment-491848487 (specifically around adding request/limit flags to velero install).

As part of this issue, we should probably add request/limit flags to velero install to be able to specify these.

I've been monitoring Velero's CPU and memory usage and in my GKE cluster using Stackdriver I've found that the server pod uses 42M memory and 5m CPU at rest. I installed various applications in the cluster, including the nginx-example in this repo and 10 instances of the WordPress Helm chart to experiment with backup/restore operations. A full cluster backup created the largest spike of 53M memory and 16m CPU (see results of other operations below).

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

I need to do some further investigation for restic backups so we can define requests/limits for those pods also.

| operation | memory (M) | cpu (m) |
|-|-|-|
|backup of example nginx|42|8.4|
|backup 1 instance of Helm wordpress|42|8.4|
|backup 10 instances of Helm wordpress|42|12|
|full cluster backup|53|16|
|delete backup|42|6.7|
|delete multiple backups|53|16|
|multiple backup requests (4)|42|14|
|restore nginx-example|53|8.6|

cool!

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

What about Restic? I assume the daemonset consumes much more CPU/memory when doing filesystem backup & restore.

@prydonius this is great. Let's continue to collect some more info.

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

I wonder if the number of backups/restores that Velero is tracking has an impact on cpu/memory?

Yeah, I imagine this will have an impact - let's look at it.

+1, I'll create a scheduled backup to see how the cpu/memory use grows over time with that.

Things will definitely get fuzzier with restic. The daemonset pods are one thing (that's where restic backups and restores run), but also the restic prune operation probably consumes a decent amount of resources for larger repos, and that runs in the main velero pod.

Ah, that's good to know, when is restic prune run? I'll spend some more time testing restic backups.

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

We could generously set requests to 128Mi memory and 100m CPU, and be even more generous with limits (e.g. 256Mi, 200m). What do you think?

This seems like a reasonable place to start for non-restic deployments.

I definitely want to add flags to velero install for setting these, so they're user tuneable.

We'll also have to find a reasonable way to update the docs so that users (a) know what our baseline recommendations are, (b) understand that they may need to be changed depending on their scenario (particularly if they use restic), and (c) understand how to change them (using the velero install CLI flags or by editing the YAML)

Just to add some more data, here is the trend for an hourly full-cluster backup and half-hourly nginx-example backup:

Screen Shot 2019-07-15 at 16 56 08

The CPU spikes are pretty consistent, the memory spikes are less so but it's resting at around 25M (which is lower than what I was previously seeing).

velero install flags w/ pre-defined defaults make a lot of sense to me. We may need 2 sets for the Velero Deployment and the Restic DaemonSet, depending on what the characteristics observed are.

For the full cluster backup, how many namespaces were included?

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

For the full cluster backup, how many namespaces were included?

5 namespaces

I'm also curious about included number of volumes. I expect even a large number of GCP, AWS, or Azure volumes to have minimal impact on these numbers, given we do it asynchronously.

21 volumes (2 per each WordPress instance, and 1 for the nginx example).

Is there an easy way of spinning up a large amount of workloads/volumes to measure how velero server usage might grow as the amount of workloads/volumes grow?

Some restic results:

  • restic backup of 80gb volume

    • cpu: 1000m (max available in node)

    • mem: 114M

  • restic backup of 5gb volume

    • cpu: 200m

    • mem: 10.2M

Seems like restic varies quite a lot depending on the size of data being backed up. I'm also concerned that a limit of 1000m (1 core) on the Pod would have caused it to be restarted when performing the larger backup.

it gets run every 24h by default, but you can tweak the frequency - you can kubectl -n velero edit resticrepository NAME, and change the spec.maintenanceFrequency to something more often.

There are slightly larger spikes in the Velero Pod when maintenance is run, but not by a huge amount with the current setup I have:

Screen Shot 2019-07-17 at 10 44 30

@prydonius I wonder how it's impacted if you have a restic backup on the 80GB volume running when the restic maintenance is scheduled. I may be remembering incorrectly, but I think it may not actually impact the CPU as the maintenance will get run once the backup's done.

@nrb doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)? I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

doesn't the restic backup and maintenance happen in different Pods (restic server and velero server respectively)?

Yep!

I'm not sure what the consequences are of running a prune whilst a backup is happening, though this may be prevented by restic locking?

Yes - prune requires an exclusive lock so would not run concurrently with a backup.

image

Full cluster backup above including GCE volume snapshots

Namespaces: 317
Total Resources: 15199
Persistent Volumes: 45

Thanks for sharing @Evesy, this is really useful. Looking at the RAM use, I think the default request/limit we settled on (128Mi, 256Mi respectively) should be fine.

The CPU use during your full cluster backups is much higher than what I was seeing with much less resources. We settled on 0.1 request and 0.2 limit, which would clearly not work in your case. I'm not sure how the CPU use scales, but if we were to use your usage as a baseline, a 0.5 request and 1.0 limit could be sufficient?

I may be reading the graph incorrectly, but looking at https://kubernetes.slack.com/archives/C6VCGP4MT/p1563893211023600?thread_ts=1563890835.023500&cid=C6VCGP4MT, it seems that the CPU usage is just under 4.0.

That said, I don't think we're going to land on anything that works for everybody. It may be worthwhile to help users learn how to benchmark it and tune themselves.

@nrb that graph was the cfs throttling, the actual throttled usage in that graph is seen in the green line that sits just below the limit. @Evesy removed the limits on their Pod to get the above result.

Ah, I see, I was indeed reading it incorrectly.

The values proposed seem reasonable to me.

Revised #'s seem OK to me. They'll never be right for everyone but they seem sane and can be user-tuned as needed.

@prydonius what (if anything) is left on this issue once #1678 gets merged? I guess we don't have flags/defaults for the restic daemonset yet - anything else?

@skriss yes just the flags/defaults for restic, I think we can close this out after that

Was this page helpful?
0 / 5 - 0 ratings