Velero: add restic stale lock checking

Created on 20 May 2019  路  5Comments  路  Source: vmware-tanzu/velero

Currently, for a variety of reasons it's possible for a restic repository to have a stale lock. We don't have any way of detecting and handling this currently, so it can lead to things getting "stuck" from a velero perspective. It can cause a ResticRepository to go into a NotReady phase which then blocks all operations on that repository from executing, and I don't think it will ever really self-heal from that situation. We should add code to the restic repository controller (I think) that uses the restic unlock command to remove stale locks.

EnhancemenUser P1 - Important Restic Restic - GA

All 5 comments

If the implementation for this won't be affected by #1540, we'll move #1540 into v1.2.

I locked looked at the restic unlock implementation, here are some details:

  • first of all, long-running restic operations refresh their lock every 5 minutes by creating a new lock file in the repo with an updated timestamp and removing the old one
  • restic unlock removes "stale locks". A lock is considered stale if:

    • it's more than 30 minutes old, or;

    • it was created on the same host as restic unlock is currently running on, and the process that created the lock cannot be reached using a SIGHUP signal

So for velero:

  • if a daemonset pod terminates abormally and leaves around a lock, it'll be removed after ~30 minutes
  • if the main velero server pod terminates abormally and leaves around a lock, it'll also be removed after ~30 min (since in Kubernetes, the host name = the pod name, so the new Velero pod will have a different host name than the old one that left the lock around)
  • if a restic process within the velero server pod terminates abnormally and leaves around a lock, it'll be removed the next time the restic unlock process is run, since the lock will have been created on the same host and the process will no longer be reachable. If for some reason it's not removed at that point, it will eventually be removed once it's >30min old.

This all sounds like desirable behavior to me, and I also don't see this changing in any significant way when we get to #1540.

cc @carlisia @nrb @prydonius

Yep, that all sounds reasonable.

Does a naive implementation on our side that just runs restic unlock every X minutes make sense? If most of the logic's contained in restic unlock itself, then there doesn't seem to be a whole lot of reason to add complicated logic on our side.

Also, given that one of the conditions requires being on the same host, does this need to run within the daemonset pods and the main velero pod?

Does a naive implementation on our side that just runs restic unlock every X minutes make sense? If most of the logic's contained in restic unlock itself, then there doesn't seem to be a whole lot of reason to add complicated logic on our side.

Yep, I'm thinking this pretty straightforward - just a periodic call per ResticRepository.

Also, given that one of the conditions requires being on the same host, does this need to run within the daemonset pods and the main velero pod?

I don't think it needs to run in the daemonset pods -- any stale locks from those will still get cleared out after 30min by running it only in the velero server pod -- but if there's a need, we can look at it

Sounds good to me!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

totemcaf picture totemcaf  路  4Comments

carlisia picture carlisia  路  4Comments

onedr0p picture onedr0p  路  3Comments

my1990 picture my1990  路  3Comments

abh picture abh  路  4Comments