Cluster-api: MachineSet: support for defining specific machines when scaling down

Created on 17 Apr 2018 · 18Comments · Source: kubernetes-sigs/cluster-api

This issue is to track ability to specify which machines to be removed from a machineset when it's scaling down. This is a requirement to support Cluster Autoscaling use cases.

Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.

cc @krousey @mwielgus @krzysztof-jastrzebski @maciekpytel

help wanted prioritimportant-longterm

Source

rsdcastro

Most helpful comment

Ping.
There have been multiple efforts and discussions on this topic. Let's please try to discuss this and try to settle on the approach in the next meeting.
Achieving the atomicity seems really hard, not sure if there is really a way to do so.
I guess, we could decide on the simple priority-based approach for now and unblock the autoscaler integration.
Ref:

@maisem @roberthbailey @ingvagabund @ntfrnzn @erstapples @ingvagabund @MaciekPytel

hardikdr on 19 Feb 2019

👍2

All 18 comments

Also see https://github.com/kubernetes-sigs/cluster-api/issues/45#issuecomment-380944237

roberthbailey on 18 Aug 2018

I can see following annotation based approach as a low-hanging fruit.

We by-default prioritize all machines to priority:3 via annotation – Machine-deployment basically adds this annotation on all machines while creation.
While scaling-down the machine-deployment, Autoscaler will decrease the priority of those specific set of machines to 1 by updating annotation: priority:1.
- And then in separate call – scales-down the machine-deployment. That would avoid the risk of not-being atomic.
MachineDeployment/MachineSet reconciliation always prefers to delete the Machines with lowest-priority first and hence purpose would be served.

Though, for a long-term solution we might want to learn from Pod-Priority and PodClass: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass

Also fyi: https://github.com/kubernetes/features/issues/609

hardikdr on 29 Aug 2018

👍1

We can have multiple delete strategies, parallel to the image pull policy. E.g.

---
apiVersion: cluster.k8s.io/v1alpha1
kind: MachineSet
metadata:
  name: <NAME>
  namespace: <NAMESPACE>
  labels:
    ...
spec:
  replicas: 2
  deletePolicy: <POLICY>
  selector:
    ...
  template:
    ...

where the deletePolicy could be:

Random
Delete the oldest replica
Delete a replica with lower priority
..

It can default to the Random. The goal is to unblock the integration of the cluster API into the autoscaler in case we don't get agreement in reasonable time.

The code in question: https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/controller.go#L325-L329

ingvagabund on 2 Sep 2018

The autoscaler would only work with the lowest priority policy though, right? Otherwise the autoscaler would try to scale down by a specific machine (by changing the priority) and when the machine set was resized a random or oldest machine would instead be deleted.

roberthbailey on 4 Sep 2018

Initial strawman using annotations to mark which machines to be removed would work, but there are concerns about not having an atomic operation for this kind of change.

Is there any solution that provides atomic operation?

Machineset controller should not change the number of replicas since it's up to machineset consumers to do it. The controller's responsibility is to either create new machines or delete existing (at the same time updating machineset's status). Based on that there is no mechanism to protect a machineset from very rapidly alternating changes of replicas field. In case a specific machine is annotated, the machineset controller still lists all potential machines to be deleted. It only changes the lottery to archery contest. Hit the bulls eye first. It's likely the following can happen:

Annotate machine(s)
Decrease number of replicas
Start deleting machines, meantime someone removes the annotation label (e.g. annotating different machines)
Machineset controller removes machines that were not supposed to be deleted.
Oops

So ok, we can have another routine that will listen for machines and everytime a machine gets annotated, remove it and decrease the replicas:

Machine is annotated
Remove the machine
Decrease the number of replicas. Oops, the machineset controller notices a machine is removed -> recreate a new one since the replicas has not changed
Oops, the routine decreased the number of replicas (assuming there was no error)
Machineset controller deletes a machine to compensate the replicas changed. Oh, different machine gets delete since the list of machines to be deleted is not sorted chronologically.

The replicas logic and deletion of a specific machine are tightly connected.

ingvagabund on 25 Sep 2018

@maisem has been thinking about this as well, so adding him in here.

roberthbailey on 1 Oct 2018

As per discussion on slack, the atomicity requirement seems unclear, and maybe we can move forward with a deletePolicy setup that is just

(a) deletePolicy = simple|newest|oldest
(b) simple policy attends to a known “delete-me” annotation on the machine, if it is present
(c) policies newest|oldest preferentially remove machines based on their age (since some time, probably ObjectMeta.CreationTimestamp)

An external oracle (if it exists) can annotate the machines based on external knowledge (if it chooses to) and thus affect the simple delete policy.

ntfrnzn on 19 Dec 2018

The current simple deletion policy [1] breaks machines into 3 classes:

must delete: deletion timestamp is non zero
better delete: machines with reported error
could delete: remaining machines

Our current implementation of cluster-api based cluster-autoscaler [2] targets the must delete class (which gets removed as first) with sigs.k8s.io/cluster-api-delete-machine machine annotation [3].
The process of scaling down then reduces to:
1) cluster autoscaler annotates all candidate machines for scaling down
2) cluster autoscaler reduces machine set number of replicas
3) machineset controller starts removing all machines with sigs.k8s.io/cluster-api-delete-machine annotation before any other

So the process covers the b) case.

[1] https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/delete_policy.go#L34-L42
[2] https://github.com/openshift/kubernetes-autoscaler/tree/master/cluster-autoscaler/cloudprovider/clusterapi
[3] https://github.com/kubernetes-sigs/cluster-api/pull/513

@ntfrnzn would that be applicable for your use case?

ingvagabund on 14 Jan 2019

I'm still not convinced this approach is safe to use with Cluster Autoscaler. Sure it will work most of the time, but without atomic way to delete a specific machine you have all sorts of races that may lead to additional random machine being removed. At that point all guarantees given by CA about respecting things like PodDisruptionBudgets and various no-scale-down annotations are gone.

For scale-down CA doesn't work on machine sets, it look at individual nodes and checks which ones can be deleted. Ultimately what it does is removing a specific machine, NOT changing the desired number of machines. If someone else removes the machine while it's being drained the delete operation must fail.
In extreme case CA can even race with itself by scaling-up the machineset while it's removing a VM from it.

All those races are the reason for requesting an atomic implementation in the initial issue description.

MaciekPytel on 14 Jan 2019

From my perspective, yes, this is what I pictured for _deletePolicy: simple_.

In the meantime, I wrote a few lines of code for the time-based policies that have a sigmoid-like function, _deletePriority(creationTimeStamp, now) -> (0.0, 100)_ that maps a machine into a deletion-priority range (float64). I hesitated over where to define the string _deletePolicy_ (in MachineSet, MachineDeployment types), and haven't gotten back to it.

More importantly, I'm conflicted over the atomicity requirements. It complicates the types. One implementation would be to add a property like []string machineIdsMustNotExist to the MachineSet, so that the clusterAutoscaler can update the MachineSet by changing (a) the size and (b) the to-be-deleted-list on a single object atomically, and when the resync happens, the controller will remove the named machines and make the set the correct size.

ntfrnzn on 14 Jan 2019

/assign @maisem

roberthbailey on 23 Jan 2019

@roberthbailey: GitHub didn't allow me to assign the following users: maisem.

Note that only kubernetes-sigs members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @maisem

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 23 Jan 2019

@maisem @roberthbailey @ingvagabund @ntfrnzn @erstapples @ingvagabund @MaciekPytel

hardikdr on 19 Feb 2019

👍2

This has priority/important-longterm. Do we think we'll be able to resolve any outstanding questions/issues in time for v1alpha1?

ncdc on 5 Mar 2019

We didn't discuss this during today's meeting. I propose that this isn't required for v1alpha1 and we can defer it to a future milestone. WDYT @hardikdr @justinsb @detiber @rsdcastro?

ncdc on 6 Mar 2019

Considering there is an open PR that could resolve this: https://github.com/kubernetes-sigs/cluster-api/pull/726/files I don't see a need to bump it unless we cannot get that PR merged in time.