Currently, for a variety of reasons it's possible for a restic repository to have a stale lock. We don't have any way of detecting and handling this currently, so it can lead to things getting "stuck" from a velero perspective. It can cause a ResticRepository to go into a NotReady phase which then blocks all operations on that repository from executing, and I don't think it will ever really self-heal from that situation. We should add code to the restic repository controller (I think) that uses the restic unlock command to remove stale locks.
If the implementation for this won't be affected by #1540, we'll move #1540 into v1.2.
I locked looked at the restic unlock implementation, here are some details:
restic unlock removes "stale locks". A lock is considered stale if:restic unlock is currently running on, and the process that created the lock cannot be reached using a SIGHUP signalSo for velero:
restic unlock process is run, since the lock will have been created on the same host and the process will no longer be reachable. If for some reason it's not removed at that point, it will eventually be removed once it's >30min old.This all sounds like desirable behavior to me, and I also don't see this changing in any significant way when we get to #1540.
cc @carlisia @nrb @prydonius
Yep, that all sounds reasonable.
Does a naive implementation on our side that just runs restic unlock every X minutes make sense? If most of the logic's contained in restic unlock itself, then there doesn't seem to be a whole lot of reason to add complicated logic on our side.
Also, given that one of the conditions requires being on the same host, does this need to run within the daemonset pods and the main velero pod?
Does a naive implementation on our side that just runs restic unlock every X minutes make sense? If most of the logic's contained in restic unlock itself, then there doesn't seem to be a whole lot of reason to add complicated logic on our side.
Yep, I'm thinking this pretty straightforward - just a periodic call per ResticRepository.
Also, given that one of the conditions requires being on the same host, does this need to run within the daemonset pods and the main velero pod?
I don't think it needs to run in the daemonset pods -- any stale locks from those will still get cleared out after 30min by running it only in the velero server pod -- but if there's a need, we can look at it
Sounds good to me!