This issue is to track ability to specify which machines to be removed from a machineset when it's scaling down. This is a requirement to support Cluster Autoscaling use cases.
Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.
cc @krousey @mwielgus @krzysztof-jastrzebski @maciekpytel
Also see https://github.com/kubernetes-sigs/cluster-api/issues/45#issuecomment-380944237
I can see following annotation based approach as a low-hanging fruit.
priority:3 via annotation – Machine-deployment basically adds this annotation on all machines while creation.priority:1.Though, for a long-term solution we might want to learn from Pod-Priority and PodClass: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
We can have multiple delete strategies, parallel to the image pull policy. E.g.
---
apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
name: <NAME>
namespace: <NAMESPACE>
labels:
...
spec:
replicas: 2
deletePolicy: <POLICY>
selector:
...
template:
...
where the deletePolicy could be:
It can default to the Random. The goal is to unblock the integration of the cluster API into the autoscaler in case we don't get agreement in reasonable time.
The code in question: https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/controller.go#L325-L329
The autoscaler would only work with the lowest priority policy though, right? Otherwise the autoscaler would try to scale down by a specific machine (by changing the priority) and when the machine set was resized a random or oldest machine would instead be deleted.
Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.
Is there any solution that provides atomic operation?
Machineset controller should not change the number of replicas since it's up to machineset consumers to do it. The controller's responsibility is to either create new machines or delete existing (at the same time updating machineset's status). Based on that there is no mechanism to protect a machineset from very rapidly alternating changes of replicas field. In case a specific machine is annotated, the machineset controller still lists all potential machines to be deleted. It only changes the lottery to archery contest. Hit the bulls eye first. It's likely the following can happen:
So ok, we can have another routine that will listen for machines and everytime a machine gets annotated, remove it and decrease the replicas:
replicas has not changedreplicas changed. Oh, different machine gets delete since the list of machines to be deleted is not sorted chronologically.The replicas logic and deletion of a specific machine are tightly connected.
@maisem has been thinking about this as well, so adding him in here.
As per discussion on slack, the atomicity requirement seems unclear, and maybe we can move forward with a deletePolicy setup that is just
(a) deletePolicy = simple|newest|oldest
(b) simple policy attends to a known “delete-me” annotation on the machine, if it is present
(c) policies newest|oldest preferentially remove machines based on their age (since some time, probably ObjectMeta.CreationTimestamp)
An external oracle (if it exists) can annotate the machines based on external knowledge (if it chooses to) and thus affect the simple delete policy.
The current simple deletion policy [1] breaks machines into 3 classes:
must delete: deletion timestamp is non zerobetter delete: machines with reported errorcould delete: remaining machinesOur current implementation of cluster-api based cluster-autoscaler [2] targets the must delete class (which gets removed as first) with sigs.k8s.io/cluster-api-delete-machine machine annotation [3].
The process of scaling down then reduces to:
1) cluster autoscaler annotates all candidate machines for scaling down
2) cluster autoscaler reduces machine set number of replicas
3) machineset controller starts removing all machines with sigs.k8s.io/cluster-api-delete-machine annotation before any other
So the process covers the b) case.
[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/delete_policy.go#L34-L42
[2] https://github.com/openshift/kubernetes-autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi
[3] https://github.com/kubernetes-sigs/cluster-api/pull/513
@ntfrnzn would that be applicable for your use case?
I'm still not convinced this approach is safe to use with Cluster Autoscaler. Sure it will work most of the time, but without atomic way to delete a specific machine you have all sorts of races that may lead to additional random machine being removed. At that point all guarantees given by CA about respecting things like PodDisruptionBudgets and various no-scale-down annotations are gone.
For scale-down CA doesn't work on machine sets, it look at individual nodes and checks which ones can be deleted. Ultimately what it does is removing a specific machine, NOT changing the desired number of machines. If someone else removes the machine while it's being drained the delete operation must fail.
In extreme case CA can even race with itself by scaling-up the machineset while it's removing a VM from it.
All those races are the reason for requesting an atomic implementation in the initial issue description.
From my perspective, yes, this is what I pictured for _deletePolicy: simple_.
In the meantime, I wrote a few lines of code for the time-based policies that have a sigmoid-like function, _deletePriority(creationTimeStamp, now) -> (0.0, 100)_ that maps a machine into a deletion-priority range (float64). I hesitated over where to define the string _deletePolicy_ (in MachineSet, MachineDeployment types), and haven't gotten back to it.
More importantly, I'm conflicted over the atomicity requirements. It complicates the types. One implementation would be to add a property like []string machineIdsMustNotExist to the MachineSet, so that the clusterAutoscaler can update the MachineSet by changing (a) the size and (b) the to-be-deleted-list on a single object atomically, and when the resync happens, the controller will remove the named machines and make the set the correct size.
/assign @maisem
@roberthbailey: GitHub didn't allow me to assign the following users: maisem.
Note that only kubernetes-sigs members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide
In response to this:
/assign @maisem
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Ping.
There have been multiple efforts and discussions on this topic. Let's please try to discuss this and try to settle on the approach in the next meeting.
Achieving the atomicity seems really hard, not sure if there is really a way to do so.
I guess, we could decide on the simple priority-based approach for now and unblock the autoscaler integration.
Ref:
@maisem @roberthbailey @ingvagabund @ntfrnzn @erstapples @ingvagabund @MaciekPytel
This has priority/important-longterm. Do we think we'll be able to resolve any outstanding questions/issues in time for v1alpha1?
We didn't discuss this during today's meeting. I propose that this isn't required for v1alpha1 and we can defer it to a future milestone. WDYT @hardikdr @justinsb @detiber @rsdcastro?
Considering there is an open PR that could resolve this: https://github.com/kubernetes-sigs/cluster-api/pull/726/files I don't see a need to bump it unless we cannot get that PR merged in time.
There is also #513.
/assign
Most helpful comment
Ping.
There have been multiple efforts and discussions on this topic. Let's please try to discuss this and try to settle on the approach in the next meeting.
Achieving the atomicity seems really hard, not sure if there is really a way to do so.
I guess, we could decide on the simple priority-based approach for now and unblock the autoscaler integration.
Ref:
@maisem @roberthbailey @ingvagabund @ntfrnzn @erstapples @ingvagabund @MaciekPytel