Cluster-api: Node should be deleted after infraMachine is gone

Created on 6 Mar 2020 · 8Comments · Source: kubernetes-sigs/cluster-api

What steps did you take and what happened:
Delete a machine

What did you expect to happen:
Currently when we signal a machine for deletion we respect pod safety by honouring PDB (draining), then we delete the node, then we signal the InfraMachine for deletion.
https://github.com/kubernetes-sigs/cluster-api/blob/e9038a58adf693d7b54228f145fee85416a94a83/controllers/machine_controller.go#L275-L303

Node deletion is one of possible the mechanisms to force a stateful pod to be deleted.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#pod-safety-consistency-guarantees-and-storage-implications

If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version:
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug

kinbug lifecyclactive

Source

enxebre

Most helpful comment

I see Cloud controller manager as something orthogonal here. We want to provide the same provider agnostic CAPI semantics/behaviour/UX and not every environment is guaranteed to have a Cloud controller manager nor to delete the node, e.g libvirt, baremetal...
Plus we are the authority source voluntarily destroying the underlying infra, I see not point on keeping the node.

enxebre on 6 Mar 2020

👍2

All 8 comments

cc @JoelSpeed @detiber @ncdc

enxebre on 6 Mar 2020

👍 this has been in my head for a while but I never got around to filing. Thank you! This will also speed up deletion, because right now when we remove the Node from the apiserver, the kubelet is still running, and it's trying to update its .status, which fails. I think this causes the shutdown of the kubelet process to take longer than it otherwise would (definitely noticeable delays).

ncdc on 6 Mar 2020

Should the machine controller be deleting the Node at all? I thought this was the responsibility of the K8s cloud controller manager?

From the K8s docs on cloud-controller-manager

node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.

My expectation of how this worked would have been:

Machine is deleted by something
Machine controller cordons/drains node
Machine controller deletes InfrastructureMachine
InfrastructureMachine controller sends request to cloud provider to terminate instance
Instance shuts down
Cloud controller manager removes the Node from the cluster

I think the important part of this is that the Node is removed after kubelet has stopped as @ncdc suggested. Is there a reason we can't rely on the cloud controller manager for this and have to do it ourselves? (I'm guessing something to do with many providers, not all of them implementing the node deletion part of CCM?)

EDIT: I see this was added in #809 to ensure that nodes are cleaned up if there is no CCM

JoelSpeed on 6 Mar 2020

enxebre on 6 Mar 2020

👍2

Just to echo some of the points above (and expand a bit) on the reasons we added the Node deletion. It was indeed to remain consistent in cases where a provider may not have an associated CCM, but also to provide consistency in cases where there were different behaviors exhibited by CCMs for different providers. I believe maintaining consistency here is important.

We can definitely talk about trying to get to a point in the future where there is a requirement to have a CCM for a given infrastructure provider, however we would also have to make sure that there is consistent behavior between them as well.