Related slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1596471116438700
User Story
As an operator I would like to take backups of a workload cluster's CAPx resources on the management cluster in order to be able to restore this backup to a different management cluster in a disaster recovery scenario (total loss of management cluster).
Detailed Description
There is a lot of code in clusterctl move that ensures clusters are paused, objects are created in the correct order, and controller and owner references are set correctly.
All this exact same logic also applies to taking and restoring backups.
The idea would be to take a lot of code from /cmd/clusterctl/client/cluster/mover.go and /cmd/clusterctl/client/cluster/objectgraph.go, move some of it into a new library, and build backup and restore commands.
At the top level, I see the backup performing the following steps:
ClusterUnstructuredList from a given namespace (same as mover.go)The restore would:
UnstructuredList from file on disk (namespace can then be inferred from the objects in that listobjectgraphgetMoveSequence to figure out in which order to restore.ClusterAnything else you would like to add:
Depending on how this code ends up structured, this could become a new public package which could be imported by something like a Velero plugin. This would make Velero inherently aware of CAPx without duplicating too much code.
/kind feature
/area clusterctl
cc @nrb @carlisia @ashish-amarnath
/assign
I can take a look at this
so basically, we should create a json file that contains all the information that we are going to do the move action, but the move action only occurs between bootstrap ==> workload cluster
while the desire use case of this backup/restore happens on both bootstrap (before move) and workload cluster (after move) ,correct?
It is not clear to me if we are going to implement two new top level commands or if we are going to make backup and restore as move options e.g.
clusterct move --to-file (backup)
clusterct move --from-file (restore)
However, I would break the implementation down into two logical parts.
Also, given that the target scenario is recovery from a disaster, I think the Pause/Unpause logic should not be triggered.
Definetly +1 to get this exposed as a library func
We should probably first figure out the plan for move #3354
ok, I will wait for #3354 before work on this , thanks for the reminder @vincepri @fabriziopandini
or you think clusterct move --to-file (backup) can be implemented anyway?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Most helpful comment
cc @nrb @carlisia @ashish-amarnath