What steps did you take and what happened:
Delete a machine
What did you expect to happen:
Currently when we signal a machine for deletion we respect pod safety by honouring PDB (draining), then we delete the node, then we signal the InfraMachine for deletion.
https://github.com/kubernetes-sigs/cluster-api/blob/e9038a58adf693d7b54228f145fee85416a94a83/controllers/machine_controller.go#L275-L303
Node deletion is one of possible the mechanisms to force a stateful pod to be deleted.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#pod-safety-consistency-guarantees-and-storage-implications
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."
To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version):/etc/os-release):/kind bug
cc @JoelSpeed @detiber @ncdc
馃憤 this has been in my head for a while but I never got around to filing. Thank you! This will also speed up deletion, because right now when we remove the Node from the apiserver, the kubelet is still running, and it's trying to update its .status, which fails. I think this causes the shutdown of the kubelet process to take longer than it otherwise would (definitely noticeable delays).
Should the machine controller be deleting the Node at all? I thought this was the responsibility of the K8s cloud controller manager?
From the K8s docs on cloud-controller-manager
node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.
My expectation of how this worked would have been:
I think the important part of this is that the Node is removed after kubelet has stopped as @ncdc suggested. Is there a reason we can't rely on the cloud controller manager for this and have to do it ourselves? (I'm guessing something to do with many providers, not all of them implementing the node deletion part of CCM?)
EDIT: I see this was added in #809 to ensure that nodes are cleaned up if there is no CCM
I see Cloud controller manager as something orthogonal here. We want to provide the same provider agnostic CAPI semantics/behaviour/UX and not every environment is guaranteed to have a Cloud controller manager nor to delete the node, e.g libvirt, baremetal...
Plus we are the authority source voluntarily destroying the underlying infra, I see not point on keeping the node.
Just to echo some of the points above (and expand a bit) on the reasons we added the Node deletion. It was indeed to remain consistent in cases where a provider may not have an associated CCM, but also to provide consistency in cases where there were different behaviors exhibited by CCMs for different providers. I believe maintaining consistency here is important.
We can definitely talk about trying to get to a point in the future where there is a requirement to have a CCM for a given infrastructure provider, however we would also have to make sure that there is consistent behavior between them as well.
/assign @enxebre
/milestone v0.3.x
/lifecycle active
Most helpful comment
I see Cloud controller manager as something orthogonal here. We want to provide the same provider agnostic CAPI semantics/behaviour/UX and not every environment is guaranteed to have a Cloud controller manager nor to delete the node, e.g libvirt, baremetal...
Plus we are the authority source voluntarily destroying the underlying infra, I see not point on keeping the node.