Cluster-api: machine remains undeleted while cluster has been deleted

Created on 24 Oct 2019  ·  15Comments  ·  Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

  • create a cluster, and create machines use machineset, everything goes ok.
  • delete the cluster
  • find out that still remains machine in deleting state, the log says ”can not find cluster “ from machine controller

What did you expect to happen:
all resources under the cluster be deleted successfully

Anything else you would like to add:
the root cause of this bug is that cluster controller try to delete it's child in background (in this case, only the machinSet, machines are the child of the machineSet) and list it's child 5 seconds later, if not exist anymore, cluster controller will proceed to delete it infra resource. on the other way, the machineSet controllers uses the kubernetes garbage collector to delete the machine resources, if we delete the machineset in Background
propagationPolicy , the machineset resource will be deleted immediately. so the next time cluster controller list it's children, the machineset will not be found any more. but the machine deleting in background may still be processed by the machine controller, and need get the cluster obj in it's reconcile func, thus the finalizer will not be successfully removed.

Solutions
i try the Foreground propagationPolicy and this case solved, but i'm not sure if it is a proper solution.

Notes
Here is a Column describes the resources's ownerRef and finalizer in cluster-api and it's provider

  | ownerRef Name | ownerRef Set By | Finalizer Name | Finalizer Set By
-- | -- | -- | -- | --
cluster | / | / | cluster.cluster.x-k8s.io | cluster-controller
infra cluster | cluster | cluster-controller | self-defined-infra-cluster-finalizer | clusterinfra-controller
machineDeployment | cluster | machineDeployment controller | / | /
machineSet(create alone) | cluster | machineSet controller | / | /
machineSet(create by machine deployment)) | machineDeployment | machineDeployment controller | / | /
machine(create alone) | cluster | machine controller | machine.cluster.x-k8s.io | machine-controller
machine(create by machineset) | machineSet | machineSet controller
machine infra | machine | machine controller | self-defined-infra-machine-finalizer | machineinfra-controller
bootstrap | machine | machine controller |   |  

All 15 comments

@detiber @ncdc @vincepri any good idea?

Thanks for the detailed research! Foreground deletion is a good idea, but I think it might be better to allow a machine to be deleted in the absence of a cluster. I think this will be required if you start from scratch, create a machine, and then try to delete it (without creating any cluster).

Thanks for the detailed research! Foreground deletion is a good idea, but I think it might be better to allow a machine to be deleted in the absence of a cluster. I think this will be required if you start from scratch, create a machine, and then try to delete it (without creating any cluster).

in normal case, we need the cluster exist if we want to delete the machine, because we need to delete the k8s node in workload cluster.

So maybe we need both foreground deletion from the cluster's code + we need to allow a machine to be deleted if there's no cluster for it.

Edit: nevermind on that, I'm going back to my original comment of not blocking machine deletion if the cluster is missing

So maybe we need both foreground deletion from the cluster's code + we need to allow a machine to be deleted if there's no cluster for it.
i think ’the allow no cluster part‘ is not that simple, it is hard to decide whether we need a cluster...

+1 on allowing dependent objects to be deleted when the parent isn't there

My concern with foreground deletion is that it would block any additional events from happening while waiting on the deletion, I think we should try to maintain asynchronous behavior here.

Looking at a few scenarios:

  1. Given a Cluster, Machine(s) for control plane, MachineDeployment(s) for workers:

    1. Delete Cluster



      1. Cluster controller starts deleting children & the sequencing is such that the Cluster is deleted before all the Machines are deleted (this issue)


      2. Opinion: if the Cluster is gone, blocking the deletion of the Machine is unnecessary



  2. Given a Machine and nothing else:

    1. Delete Machine



      1. Fails because Cluster does not exist


      2. Opinion: if the Cluster is gone, blocking the deletion of the Machine is unnecessary



  3. Given a Cluster, Machine(s) for control plane, MachineDeployment(s) for workers:

    1. Delete a single Machine



      1. Expectation is that Cluster exists so the MachineReconciler can talk to the workload cluster to delete the Node for this Machine


      2. Opinion: this is the happy path



Am I missing anything?

Does the issue stem from the way we are filtering "child" resources here?

Should we also ensure that we block on any Machine* resources that are associated with the Cluster rather than rather than just the direct children (while still only issuing deletes to direct children)?

Should we also ensure that we block on any Machine* resources that are associated with the Cluster rather than rather than just the direct children (while still only issuing deletes to direct children)?

That could be an improvement to the current logic. But I also think we need to allow Machine deletions when the Cluster doesn't exist.

That could be an improvement to the current logic. But I also think we need to allow Machine deletions when the Cluster doesn't exist.

No objections there.

https://github.com/kubernetes-sigs/cluster-api/issues/1649 to allow machine deletions when cluster is not there

Ok, did some testing. The only time I was able to get the machine deletion to hang is with this flow:

  1. Create machine
  2. Create cluster
  3. Remove cluster finalizers
  4. Delete cluster
  5. Try to delete machine

This simulates what is outlined in this issue, where the cluster accidentally gets deleted before all the machines. I think if we do what @detiber suggested, and block removing the cluster's finalizer if there are any descendants (direct or indirect), this should solve the problem. Thoughts?

block removing the cluster's finalizer if there are any descendants (direct or indirect), this should solve the problem

This is the solution I'd prefer, it's simple and the behavior I'd expect if I were to use CAPI today

PTAL at #1650

Was this page helpful?
0 / 5 - 0 ratings