We need to account for a few things:
From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully.
In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them.
A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish.
There are, however, plenty of situations where the Ark server could exit while doing work:
This is definitely something we need to handle
This needs a quick test (from code) to trigger:
I had a thought about how to implement this.
Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to ark server as a flag).
Each controller worker is also assigned a unique identifier.
When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set status.arkServerID and status.workerID. Assuming that succeeds without a conflict, the worker can proceed to do its work.
When a worker sees an InProgress item, it checks status.arkServerID
status.workerIDThe controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over.
There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it.
It would be interesting to either be able to limit the number of concurrent tasks or have the option to use it as it is right now (backup queue).
It would be better for me, because I would like to limit the load on my shared file servers (Ceph RBD / CephFS) while backups are being taken. This way, I can ensure that workloads are not impacted too much by the backup tasks.
@xmath279 Thanks for that feedback!
Will be done after design is finished - https://github.com/vmware-tanzu/velero/issues/2601
Most helpful comment
It would be interesting to either be able to limit the number of concurrent tasks or have the option to use it as it is right now (backup queue).
It would be better for me, because I would like to limit the load on my shared file servers (Ceph RBD / CephFS) while backups are being taken. This way, I can ensure that workloads are not impacted too much by the backup tasks.