Cluster-api: machine-controller: block deletion with finalizer/annotation

Created on 10 Oct 2019  路  31Comments  路  Source: kubernetes-sigs/cluster-api

User Story

As an operator I would like to have the ability to build custom components that integrate into the life-cycle of a machine. Specifically, I would like to have the ability to perform different actions between the time a machine is deleted in the api and the time the machine is deleted from the cloud.

One use-case is to ensure that a new master machine has been successfully created and joined the cluster before deleting a master machine. This might be useful in case there are disruptions during replacement and we need the disk of the existing master to perform some disaster recovery operation.

This method is mostly designed to let users/operators 'hook' into the the lifecycle of a machine as they see fit. This might also be helpful for developing integrations with new clouds or other types of infrastructure that need more fine-grained signaling.

Detailed Description

Support a new finalizer. Finalizer should have a known-name, and possibly support a list (or csv, etc) that different controllers can append/remove from; this would allow arbitrary number of controllers without having to limit integrations.

Ex:

machines-delete-hooks: "myfinalizer1,myfinalizer2"

Perhaps we should do this before drain? Perhaps we should have two separate finalizers, one before drain, one before delete? I think probably just having the finalizer work before drain is going to cover most of what someone would want, and we could always add a second one later.

Goals


    1. Allow users/operators more control over a machine's life cycle.

Non-Goals


    1. API changes.

Anything else you would like to add:

During discussions of adding drain to the machine-controller, there was talk about letting an external component handle that. Users could implement their own server-side drain controller (or use a community one) and use the exclude-draining annotation to replace the machine-controller's built-in drain functionality if they so chose. There are probably lots of different use cases and this would just be one tool we add for flexibility.

/kind feature

aremachine kinfeature kinproposal lifecyclfrozen prioritimportant-longterm

All 31 comments

Thank you for filing this! I have been thinking about a need for this too.

@michaelgugino I do like this idea, however how would a service implementing one of these hooks know that it is time to act? Would we need to add some type of status field to tell the hook service that it is time to attempt to act and remove it's entry from the finalizer?

I also wonder if having each hook register it's own separate finalizer might make sense in this case, since it would avoid potential conflicts around adding/appending to the list in a single finalizer.

We'll need to make sure that the logic we have around deletion and handling finalizers works properly with this workflow as well. We wouldn't want to re-add the machine finalizer while waiting for these hooks to actuate, or re-reconcile the deletion if it has already completed on the machine controller side.

Instead of adding more finalizers to the Machine, which might require write permissions to external controllers to Cluster API. We could add a list of external CRD dependencies in Spec which we wait for, like we do today for bootstrap and infrastructure references.

Something like (don't mind the name too much):

Dependencies []corev1.ObjectReference `json:dependencies,omitempty`

Each dependency is an object reference to _any_ external object. The external utilities we have today will:

__On reconcile__

  • Check for existence
  • Add an OwnerReference to the Machine
  • Wait for Status.Ready = true

__On delete__

  • Issue a deletion request for each reference
  • Wait for references to disappear from APIServer
  • Remove the Machine's finalizer when all external objects are deleted

Would it be appropriate to table the implementation details discussion for now, and move this to a CAEP?

edit: nevermind. keep talking here!

Could this be used to clean up etcd and kubeadm metadata when replacing control plane nodes?

Could this be used to clean up etcd and kubeadm metadata when replacing control plane nodes?

Potentially, but I think having explicit management around it within the ControlPlane management implementation as defined within the CAEP draft would be less error prone.

What if we did something like this (pretend we don't have cordon & drain built-in & enabled for all machines):

apiVersion: cluster.x-k8s.io/v1beta3
kind: MachineDeletionHookRegistration
metadata:
  name: cordon-drain.cluster.x-k8s.io
spec:
  selector: env=prod

When CAPI reconciles a Machine, it performs a List of all MachineDeleteHookRegistrations, optionally matching the selector with the Machine's labels, and then sets machine.spec.deletionHooks to an array of the names of all the hooks.

Each hook implementation watches Machines, and when it sees a Machine with a nonzero deletion timestamp, the hook performs its action. When the hook completes, it maybe updates machine.status.deletionHooks[$name]... with completed/failed, error info, etc.

The machine reconciler would not proceed to delete the referenced infrastructure object nor would it remove the single machine finalizer until all the deletion hooks have completed.

With this approach, the end user creating the machine is not burdened with the need to make sure they fill in spec.deletionHooks. Instead, the system does it automatically based on what the CAPI administrator has configured.

That seems like a good approach to me, I'd make the hook generic though, something like:

apiVersion: cluster.x-k8s.io/v1beta3
kind: HookRegistration
metadata:
  name: cordon-drain.cluster.x-k8s.io
  labels:
     <target api group>: cluster...
     <target type>: Machine...
spec:
  target: deletion <--- should be a typed constant defined under machine_types.go or similar
  selector: env=prod

The labels could be specified in spec and a reconciler could backfill those, the reason to keep them is to quickly list all the hooks related to a specific type. The APIGroup could be un-necessary, haven't thought much about it.

+1 to generic.

I'd move what you have in metadata.labels under spec.

What if we did something like this (pretend we don't have cordon & drain built-in & enabled for all machines):

apiVersion: cluster.x-k8s.io/v1beta3
kind: MachineDeletionHookRegistration
metadata:
  name: cordon-drain.cluster.x-k8s.io
spec:
  selector: env=prod

When CAPI reconciles a Machine, it performs a List of all MachineDeleteHookRegistrations, optionally matching the selector with the Machine's labels, and then sets machine.spec.deletionHooks to an array of the names of all the hooks.

Each hook implementation watches Machines, and when it sees a Machine with a nonzero deletion timestamp, the hook performs its action. When the hook completes, it maybe updates machine.status.deletionHooks[$name]... with completed/failed, error info, etc.

The machine reconciler would not proceed to delete the referenced infrastructure object nor would it remove the single machine finalizer until all the deletion hooks have completed.

With this approach, the end user creating the machine is not burdened with the need to make sure they fill in spec.deletionHooks. Instead, the system does it automatically based on what the CAPI administrator has configured.

I don't think this approach is any better than the finalizers if the other components have to report the status via updating the machine-object (which implies write permissions). If we're updating the machine-object with the other components, let's just put the deletion-blocking info into the machine directly vs a CR. It could be a finalizer, or some dictionary/list in the status field, or an annotation, or whatever.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/lifecycle frozen

I'd like this to be a candidate feature for v1alpha4

Sounds good, it probably needs a better spec in a proposal form, but if someone wants to try to tackle that +1

@ncdc
So, one alternative, kind of a hybrid of two approaches outlined here.

Instead of having another CRD, we just utilize the status of the machine object. If we're proposing that 3rd party controllers are updating this field, then let's do that up front. If a machine has no deletion timestamp, some other controller can update an individual machine object with an entry into a hook list.

status:
  hooks:
    pre-drain:
      - name: some-hook
        owner: some-owner
    pre-delete: []

It would be up to the individual hook controllers to decide when to add their hook to the status field. I think most will want to add after it has a valid node-ref.

Once the hook controller has determined it's finished with an individual machine, it will remove the entry from the status object on the machine (or maybe set complete: true).

One question is, what if we want to disable/remove one of these hooks after the machine has been deleted. One way to accomplish this would be with an annotation to skip one or more hooks.

This kind of brings us full circle. We might as well just use annotations. They are easy for users to delete, they are pretty straight forward. We're already using an annotation to skip drain (eg, a lifecycle hook somewhat).

Unfortunately, annotations don't support structured types, they are map string[string].

Maybe we can do something like

metadata:
  annotations:
    predrain.hook.machine.cluster-api.x-k8s.io/my-hook: owner-name

Controller code is simple

for key, _ := range annotations {
  strings.HasPrefix(key, "predrain.hook.machine.cluster-api.x-k8s.io")
  return true
}
return false

I'd like to avoid the use of Status for implementing this feature for a couple of reasons:

  • It would not persist across backup/restore or clusterctl move operations
  • Having external controllers updating a Status for a resource means we will have competing controllers updating the spec/status of the same resource, which is something that we were trying to avoid in the past.

I can't say I'm a huge fan of the idea of an external controller modifying resources to "register" itself as a pre/post deletion hook (rather than it being pre-configured on the resource at creation time), but if we do go down that path, I think Annotations is probably the best way to achieve that.

I can't say I'm a huge fan of the idea of an external controller modifying resources to "register" itself as a pre/post deletion hook (rather than it being pre-configured on the resource at creation time), but if we do go down that path, I think Annotations is probably the best way to achieve that.

@detiber that's a good point, and I don't disagree. There's not one-way to do it if we use annotations. If you want to apply them at creation, you can. If you want some automated process to apply them based on whatever logic you decide, you can.

I agree we should avoid using status for this.

I don't have any huge issues with an external controller here, but I don't love the fact that there is a period of time during which the annotation is absent (Machine is brand new, persisted to etcd, and the external controller hasn't processed it yet). For delete hooks, this is probably fine, because the rapid creation and then immediate deletion of a Machine doesn't make much sense (although it's still possible). If we had other hooks that operated earlier in the lifecycle, it's possible such a gap could be a problem.

A couple of other random thoughts come to mind, although I'm not sure we'd want to do either of these here:

1) We could see if there's some way to tie in to the templating proposal (#2339) as a means to add extra metadata (i.e. hooks). I'm not sure this is super practical, especially if you have controllers creating machines from templates (KCP, MD/MS).

2) OpenShift has its Project virtual resource fronting Namespaces, along with project templates. Depending on how you set up RBAC, you can deny regular users the ability to create Namespaces, and instead allow them to create project requests. If your policy auto-approves, the project request turns into a namespace, based on the project template. This design could be mapped to Machines, where a user or controller would create a "machine request" and a controller would combine that with templates/logic to realize a real Machine.

I don't have any huge issues with an external controller here, but I don't love the fact that there is a period of time during which the annotation is absent (Machine is brand new, persisted to etcd, and the external controller hasn't processed it yet). For delete hooks, this is probably fine, because the rapid creation and then immediate deletion of a Machine doesn't make much sense (although it's still possible). If we had other hooks that operated earlier in the lifecycle, it's possible such a gap could be a problem.

Deletion of a self-hosted management cluster is the one case where creation/deletion happens in somewhat quick succession as a course of normal operation.

I forgot to post my comment here last week, apologies. I was going to propose to move this proposal to a full CAEP / Google Doc, and possibly see if we can merge this use case with #2846, it was mentioned in there that we might want to have multiple remediation steps before deletion, which should fit the use cases described in this proposal.

I'd prefer to keep the issues separate. That proposal is about the MHC doing something other than deleting the machine object (such as rebooting the machine, or performing some other unrelated action). This is mostly about pausing the machine-controller from doing some portion of it's normal operation based on some input.

I don't have any huge issues with an external controller here, but I don't love the fact that there is a period of time during which the annotation is absent (Machine is brand new, persisted to etcd, and the external controller hasn't processed it yet). For delete hooks, this is probably fine, because the rapid creation and then immediate deletion of a Machine doesn't make much sense (although it's still possible). If we had other hooks that operated earlier in the lifecycle, it's possible such a gap could be a problem.

@ncdc as I mentioned above, having the controller apply the initial annotation (or what have you) is optional. If we use annotations, the user (or whoever creates the machine object) can apply those annotations at creation time, if appropriate.

In the case of Masters / Control Plane hosts, applying the 'don't delete until replacement is created' is something that could be applied at creation time.

A use case is an administrator or some controller has identified some condition on a node. They want to ensure the node is not drained or deleted until they have investigated or allowed some process to run until completion on that host. An example of this might be running a converged storage node, and you need to ensure the blocks are replicated before getting rid of the node (eg, your storage cluster is healthy, able to lose one node). This would be especially true if your storage system is not directly integrated with k8s (eg, runs on the host directly instead of as pods).

The usecases are numerous.

@michaelgugino #2846 started as that, but it quickly turned out to be larger than expected.

We're expecting folks there to bring up a proposal, a quick recap from last week's conversation:

  • Want to have the ability for any external controller to be able to perform some steps before remediation happens.
  • Applies to any Machine, not just MHC-controller machines, or KCP ones.
  • Initial idea was to apply an annotation on the Machine, which (as you mentioned) the Machine controller would understand and wait for it.
  • It was proposed a possible approach with a new CRD.

We can unify our efforts, and potentially try to see if we can use the experiments package to try this out in v1alpha3.

@vincepri I disagree that they are related.

2846 is about applying some annotation (or in the worst case, make some remediation record CRD) to a machine for some other controller to do work related to the remediation. The machine-controller is not in charge of doing anything it wouldn't normally do. In the case they are proposing, the machine-object is not deleted. They want to apply an annotation to the machine-object and reboot the host (or perform some other action).

Right now, MHC just deletes the machine. They want the MHC to not delete it, make some other component delete it or reboot it, or do whatever, based on some yet to be determined mechanism. I can say I'm not a fan of that right now, and I don't want what I'm proposing here to hinge on what may or may not come out of that proposal.

What I'm proposing is lifecycle hooks. Places where users can direct the machine-controller to pause during it's normal operation (creation, deletion, at key places such as drain or instance removal).

/kind proposal

Let's move to a Google Doc CAEP whenever possible and post the link here.

/area machine

this is a great discussion, thanks for kicking it off Mike.

i do have a question though, will we need to preserve some sort of ordering for these deletion finalizers? (i saw a mention of pre-drain and pre-delete in one of the examples)

reason i am asking is because, imo, from a user perspective it would be really nice to see a Machine resource and be able to inspect a key that has a simple list, or something, of the finalizer actions that will occur. for example(using the original idea):

machine-delete-hooks: "finalizer1, finalizer2, finalizer3"

i can easliy understand that on delete finalizer1 will occur, then 2, then 3. i think we should pay mind to how a user will interpret these finalizers. when i think about the CRD oriented approach where there are watchers happening behind the scenes, i do wonder if there is a precedence or ordering that will be preserved.

@elmiko I'm going to draw up a full CAEP. If we go with the annotation method I proposed here: https://github.com/kubernetes-sigs/cluster-api/issues/1514#issuecomment-615504053

Then ordering doesn't matter, at least not in the context of the machine-controller. Other components can decide ordering among themselves, if at all. The machine-controller just doesn't move onto the next action if one or more hooks are in place.

EG, let's say you have two hook annotations

predrain.hook.machine.cluster-api.x-k8s.io/hook1: my-controller-1
predrain.hook.machine.cluster-api.x-k8s.io/hook2: my-controller-2

my-controller-1 might not know anything about my-controller-2, and it just does what it does. "Okay, I see a machine with a deletion timestamp and the annotation I'm in charge of, I do my thing and remove the annotation."

Meanwhile, my-controller-2 knows about my-controller-1. It has logic to wait for the removal of hook1.

Personally, I would expect the usecase of multiple hooks on the same machine object to be quite limited.

If someone really wants to get sophisticated, they might have my-controller-1 set some annotation like 'storage-cleanup.machine.my.org/complete: true' and have my-controller-2 check for that condition before proceeding. This would be one way to enforce ordering, and might make a useful optional section in the CAEP.

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

@vincepri: Closing this issue.

In response to this:

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings