Cluster-api: clusterctl backup/restore

Created on 3 Aug 2020  路  8Comments  路  Source: kubernetes-sigs/cluster-api

Related slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1596471116438700

User Story

As an operator I would like to take backups of a workload cluster's CAPx resources on the management cluster in order to be able to restore this backup to a different management cluster in a disaster recovery scenario (total loss of management cluster).

Detailed Description

There is a lot of code in clusterctl move that ensures clusters are paused, objects are created in the correct order, and controller and owner references are set correctly.
All this exact same logic also applies to taking and restoring backups.

The idea would be to take a lot of code from /cmd/clusterctl/client/cluster/mover.go and /cmd/clusterctl/client/cluster/objectgraph.go, move some of it into a new library, and build backup and restore commands.

At the top level, I see the backup performing the following steps:

  1. Pause the Cluster
  2. Retrieve the UnstructuredList from a given namespace (same as mover.go)
  3. Dump this list to a JSON file on disk

The restore would:

  1. Read the UnstructuredList from file on disk (namespace can then be inferred from the objects in that list
  2. Build the objectgraph
  3. Use the new equivalent of getMoveSequence to figure out in which order to restore.
  4. Restore the objects
  5. Un-pause the Cluster

Anything else you would like to add:

Depending on how this code ends up structured, this could become a new public package which could be imported by something like a Velero plugin. This would make Velero inherently aware of CAPx without duplicating too much code.

/kind feature
/area clusterctl

areclusterctl kinfeature

Most helpful comment

cc @nrb @carlisia @ashish-amarnath

All 8 comments

cc @nrb @carlisia @ashish-amarnath

/assign

I can take a look at this

so basically, we should create a json file that contains all the information that we are going to do the move action, but the move action only occurs between bootstrap ==> workload cluster
while the desire use case of this backup/restore happens on both bootstrap (before move) and workload cluster (after move) ,correct?

It is not clear to me if we are going to implement two new top level commands or if we are going to make backup and restore as move options e.g.

clusterct move --to-file (backup)
clusterct move --from-file (restore)

However, I would break the implementation down into two logical parts.

  • The easiest to implement is backup, which is similar to dry-run with the exception it dumps all the resources in a file.
  • Restore instead is more complex because you have to rebuild the object graph from a file before triggering the move to logic.

Also, given that the target scenario is recovery from a disaster, I think the Pause/Unpause logic should not be triggered.
Definetly +1 to get this exposed as a library func

We should probably first figure out the plan for move #3354

ok, I will wait for #3354 before work on this , thanks for the reminder @vincepri @fabriziopandini
or you think clusterct move --to-file (backup) can be implemented anyway?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Was this page helpful?
0 / 5 - 0 ratings