Cluster-api: machine remains undeleted while cluster has been deleted

Created on 24 Oct 2019 · 15Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

create a cluster, and create machines use machineset, everything goes ok.
delete the cluster
find out that still remains machine in deleting state, the log says ”can not find cluster “ from machine controller

What did you expect to happen:
all resources under the cluster be deleted successfully

Anything else you would like to add:
the root cause of this bug is that cluster controller try to delete it's child in background (in this case, only the machinSet, machines are the child of the machineSet) and list it's child 5 seconds later, if not exist anymore, cluster controller will proceed to delete it infra resource. on the other way, the machineSet controllers uses the kubernetes garbage collector to delete the machine resources, if we delete the machineset in Background
propagationPolicy , the machineset resource will be deleted immediately. so the next time cluster controller list it's children, the machineset will not be found any more. but the machine deleting in background may still be processed by the machine controller, and need get the cluster obj in it's reconcile func, thus the finalizer will not be successfully removed.

Solutions
i try the Foreground propagationPolicy and this case solved, but i'm not sure if it is a proper solution.

Notes
Here is a Column describes the resources's ownerRef and finalizer in cluster-api and it's provider

Source

xrmzju

All 15 comments

@detiber @ncdc @vincepri any good idea?

xrmzju on 24 Oct 2019

Thanks for the detailed research! Foreground deletion is a good idea, but I think it might be better to allow a machine to be deleted in the absence of a cluster. I think this will be required if you start from scratch, create a machine, and then try to delete it (without creating any cluster).

ncdc on 24 Oct 2019

👍1

Thanks for the detailed research! Foreground deletion is a good idea, but I think it might be better to allow a machine to be deleted in the absence of a cluster. I think this will be required if you start from scratch, create a machine, and then try to delete it (without creating any cluster).

in normal case, we need the cluster exist if we want to delete the machine, because we need to delete the k8s node in workload cluster.

xrmzju on 24 Oct 2019

~~So maybe we need both foreground deletion from the cluster's code + we need to allow a machine to be deleted if there's no cluster for it.~~

Edit: nevermind on that, I'm going back to my original comment of not blocking machine deletion if the cluster is missing

ncdc on 24 Oct 2019

So maybe we need both foreground deletion from the cluster's code + we need to allow a machine to be deleted if there's no cluster for it.
i think ’the allow no cluster part‘ is not that simple, it is hard to decide whether we need a cluster...

xrmzju on 24 Oct 2019

+1 on allowing dependent objects to be deleted when the parent isn't there

vincepri on 24 Oct 2019

My concern with foreground deletion is that it would block any additional events from happening while waiting on the deletion, I think we should try to maintain asynchronous behavior here.

detiber on 24 Oct 2019

Looking at a few scenarios:

Given a Cluster, Machine(s) for control plane, MachineDeployment(s) for workers:
1. Delete Cluster
  1. Cluster controller starts deleting children & the sequencing is such that the Cluster is deleted before all the Machines are deleted (this issue)
  2. Opinion: if the Cluster is gone, blocking the deletion of the Machine is unnecessary
Given a Machine and nothing else:
1. Delete Machine
  1. Fails because Cluster does not exist
  2. Opinion: if the Cluster is gone, blocking the deletion of the Machine is unnecessary
Given a Cluster, Machine(s) for control plane, MachineDeployment(s) for workers:
1. Delete a single Machine
  1. Expectation is that Cluster exists so the MachineReconciler can talk to the workload cluster to delete the Node for this Machine
  2. Opinion: this is the happy path

Am I missing anything?

ncdc on 24 Oct 2019

Does the issue stem from the way we are filtering "child" resources here?

Should we also ensure that we block on any Machine* resources that are associated with the Cluster rather than rather than just the direct children (while still only issuing deletes to direct children)?

detiber on 24 Oct 2019

Should we also ensure that we block on any Machine* resources that are associated with the Cluster rather than rather than just the direct children (while still only issuing deletes to direct children)?

That could be an improvement to the current logic. But I also think we need to allow Machine deletions when the Cluster doesn't exist.

ncdc on 24 Oct 2019

That could be an improvement to the current logic. But I also think we need to allow Machine deletions when the Cluster doesn't exist.

No objections there.

detiber on 24 Oct 2019

https://github.com/kubernetes-sigs/cluster-api/issues/1649 to allow machine deletions when cluster is not there

ncdc on 24 Oct 2019

Ok, did some testing. The only time I was able to get the machine deletion to hang is with this flow:

Create machine
Create cluster
Remove cluster finalizers
Delete cluster
Try to delete machine

This simulates what is outlined in this issue, where the cluster accidentally gets deleted before all the machines. I think if we do what @detiber suggested, and block removing the cluster's finalizer if there are any descendants (direct or indirect), this should solve the problem. Thoughts?

ncdc on 24 Oct 2019

👍1

block removing the cluster's finalizer if there are any descendants (direct or indirect), this should solve the problem

This is the solution I'd prefer, it's simple and the behavior I'd expect if I were to use CAPI today

vincepri on 24 Oct 2019

PTAL at #1650

ncdc on 24 Oct 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[capz] kubeadm bootstrap config contains raw secrets

alexeldeib · 4Comments

Bootstrap controllers should set `Spec.Type` when creating and reading Secrets

vincepri · 4Comments

`clusterctl delete --all --include-namespace` skips capi-webhook-system namespace

mboersma · 5Comments

Using clusterctl generate yaml for Windows NamedPipes causes backslashes to be stripped

jsturtevant · 5Comments

Bootstrap cluster cleaned up despite failed pivot

dlipovetsky · 5Comments