Velero: [Epic] support running multiple velero backups/restores concurrently

Created on 14 May 2018 · 7Comments · Source: vmware-tanzu/velero

We need to account for a few things:

If 2 server pods run simultaneously, they both might get the same Backup or Restore in a New state. They would both attempt to change the state to InProgress, they would likely both succeed, resulting in undesirable behavior.
If a Backup or Restore is InProgress and the server terminates for any reason (scaled down, crashes, terminates normally), we ideally need to be able to have the replacement server process pick up whatever was in progress instead of having it linger.

EnhancemenUser Epic P1 - Important Performance Reviewed Q2 2021

Source

ncdc

Most helpful comment

It would be interesting to either be able to limit the number of concurrent tasks or have the option to use it as it is right now (backup queue).

It would be better for me, because I would like to limit the load on my shared file servers (Ceph RBD / CephFS) while backups are being taken. This way, I can ensure that workloads are not impacted too much by the backup tasks.

xmath279 on 18 Jul 2020

👍4

All 7 comments

From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully.

In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them.

rosskukulinski on 14 Jun 2018

A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish.

There are, however, plenty of situations where the Ark server could exit while doing work:

exceeding the grace period on a normal shutdown
OOM killed
a bug of some sort that causes a crash

This is definitely something we need to handle

ncdc on 14 Jun 2018

This needs a quick test (from code) to trigger:

get a backup thats new
patch it to inprogress
patch again to inprogress (this should fail, since already in progress)

rosskukulinski on 6 Aug 2018

I had a thought about how to implement this.

Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to ark server as a flag).

Each controller worker is also assigned a unique identifier.

When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set status.arkServerID and status.workerID. Assuming that succeeds without a conflict, the worker can proceed to do its work.

When a worker sees an InProgress item, it checks status.arkServerID

If there are no running pods matching that name, the worker resets the status back to New for reprocessing.
If there is a running pod matching that name, and it matches this ark server, reset the status to New if there are no active workers matching status.workerID

The controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over.

There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it.

ncdc on 11 Oct 2018

It would be interesting to either be able to limit the number of concurrent tasks or have the option to use it as it is right now (backup queue).

xmath279 on 18 Jul 2020

👍4

@xmath279 Thanks for that feedback!