Clusterctl move today uses the Cluster's paused spec or pause annotation to block reconciliation on objects that are being moved. In past versions of clusterctl, we've actually preferred to stop all deployments in both the source and target management clusters before moving any object, to avoid slow workers or operations to be running while an object is moved and then reconciled somewhere else.
We can interact directly with deployments by first storing the current number of replicas in an annotation or somewhere else, and then scaling all deployments to zero. After moving all objects, only deployments in the target cluster should be scaled up.
/kind feature
/priority important-longterm
Just as FYI, in clusterctl there are also additional checks trying to detect if all the long-running operations are completed (e.g. insfratructureReady = true + others I don't remember on top of my mind)
It should be noted that by scaling down deployments, we are stopping reconcile for all the clusters, not only for the one we are moving, but this should not be a problem for the bootstrap/pivoting use case at least
Yeah I'm fine with stopping for everyone, especially given that we're considering rewriting move to move the entire management cluster, the extra safety is definitely worth it
maybe the check is something like below?
https://github.com/kubernetes-sigs/cluster-api/blob/master/cmd/clusterctl/client/cluster/mover.go#L106
so is proposed way to handle this is to
1) before move, record the current replic of deployment (how about daemonset, replicsets etc?) into somewhere then scale down to 0
2) move the object to target cluster
3) scale up the replica to desired number again when move operation complete
/milestone v0.4.0
Setting v1alpha4 given that this is a behavioral change
before move, record the current replic of deployment (how about daemonset, replicsets etc?) into somewhere then scale down to 0
As far as I know, we only use deployments for Cluster API controllers, although we should make it a contract going forward, or standardize on a fixed set of supported ways to get a controller up
The rest looks good, if we go down this path, is it still worth to pause/unpause the clusters?
is the pause only for forbidden deployment scale up ? if so we can do that.. if it has additional purpose such as prevent new deployment then certainly we can't do that ?
we only use deployments for Cluster API controllers, although we should make it a contract going forward
this is already documented in the clusterctl contract https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#controllers--watching-namespace
is the pause only for forbidden deployment scale up ? if so we can do that.. if it has additional purpose such as prevent new deployment then certainly we can't do that ?
Not really, it's just to give us more control over the flow, although not strictly necessary if we spin down/up the controllers.
this is already documented in the clusterctl contract https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#controllers--watching-namespace
Some other example to keep in mind for conformance, I don't think there is anything in clusterctl init that stops an infrastructure provider from using a StatefulSet for example instead of Deployment?
I don't think there is anything in clusterctl init that stops an infrastructure provider from using a StatefulSet for example instead of Deployment?
In clusterctl we are assuming there are Deployment in order to do mutations e.g. to watching namespaces. I have to check if also the image override feature relies on the same assumption
if also the image override feature relies on the same assumption
@fabriziopandini any update on this ? I want to implement this so I want to avoid
wrong direction ... thanks
and we need think about how to revert if something wrong ,for example,
what about some error happened during the move action, how to scale them back etc..
from following (a move dry run action),seems the only object to be moved is MachineDeployment ?
Creating target namespaces, if missing
Creating objects in the target cluster
Creating Cluster="wff-test" Namespace="default"
Creating AWSCluster="wff-test" Namespace="default"
Creating KubeadmConfigTemplate="wff-test-md-0" Namespace="default"
Creating KubeadmControlPlane="wff-test-control-plane" Namespace="default"
Creating MachineDeployment="wff-test-md-0" Namespace="default"
Creating AWSMachineTemplate="wff-test-control-plane" Namespace="default"
Creating AWSMachineTemplate="wff-test-md-0" Namespace="default"
Creating Machine="wff-test-control-plane-s9hsw" Namespace="default"
Creating MachineSet="wff-test-md-0-5955cfb58d" Namespace="default"
Creating Secret="wff-test-ca" Namespace="default"
Creating Secret="wff-test-etcd" Namespace="default"
Creating Secret="wff-test-kubeconfig" Namespace="default"
Creating Secret="wff-test-proxy" Namespace="default"
Creating Secret="wff-test-sa" Namespace="default"
Creating AWSMachine="wff-test-control-plane-qv95s" Namespace="default"
Creating Machine="wff-test-md-0-5955cfb58d-hbk5z" Namespace="default"
Creating KubeadmConfig="wff-test-control-plane-96gj6" Namespace="default"
Creating KubeadmConfig="wff-test-md-0-ggklr" Namespace="default"
Creating Secret="wff-test-control-plane-96gj6" Namespace="default"
Creating AWSMachine="wff-test-md-0-qrc94" Namespace="default"
Creating Secret="wff-test-md-0-ggklr" Namespace="default"
@jichenjc yes, also images overrides assumes providers are using a deployment
from following (a move dry run action),seems the only object to be moved is MachineDeployment ?
Might be there is a little bit of confusion here.
The issues is about scaling controller Deployments to 0 before move and back to X after, not about moving them.
MachineDeployment instead should be moved without any change to the number of replicas (as it is now)
Hey folks, It seems there might be a lot of confusion here so I'll try to do my best to summarize the proposed solution.
Let's assume the following:
clusterctl init has already been executed on both management clusters, all versions match (This is a pre-requisite for clusterctl move).clusterctl move to move the management cluster from A to B.At this point, usually clusterctl moves objects by pausing all Clusters managed on the Management Cluster A, copying them, and then unpausing them when they're created on Management Cluster B.
What we've been wanting to add is:
kubectl scale --replicas=0 on all deployments that run controllers (NOT the webhooks), in both management clusters.Hope this helps!
/area cluserctl
@fabriziopandini: The label(s) area/cluserctl cannot be applied, because the repository doesn't have them
In response to this:
/area cluserctl
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/area clusterctl
/assign
let me try this ..
following is only for my development reference, just record here
1) use kubectl scale --replicas=0 deployment capi-controller-manager -n capi-system to scale down controller
2) Cluster.Spec.Paused will be set and it will be used mainly to control new changes (need further check here)
Most helpful comment
Hey folks, It seems there might be a lot of confusion here so I'll try to do my best to summarize the proposed solution.
Let's assume the following:
clusterctl inithas already been executed on both management clusters, all versions match (This is a pre-requisite for clusterctl move).clusterctl moveto move the management cluster from A to B.At this point, usually clusterctl moves objects by pausing all Clusters managed on the Management Cluster A, copying them, and then unpausing them when they're created on Management Cluster B.
What we've been wanting to add is:
kubectl scale --replicas=0on all deployments that run controllers (NOT the webhooks), in both management clusters.Hope this helps!