Hey guys,
we're a heavy user of cluster-autoscaler in AWS for quite sometime, but we now want to improve the scale-up times by over-provisioning our autoscaling groups by some margin (say +n or 10%).
Is there an easy way to enable this with the current set of features?
If not, I'm happy to contribute it if you can give me a pointer where it would fit best.
Cheers,
Thomas
Not exactly, but you can achieve something like this with PriorityClasses. The idea is to create "buffer" pods which do nothing except request resources, with priority lower than your actual workload, but >0 (as CA won't add nodes for pods with negative priority value). With priority and preemption enabled, scheduler will evict those lower-priority pods to make space for higher-priority pods if necessary. The evicted buffer pod will become unschedulable and trigger scale-up.
Depending on your use case, it may work even better than overprovisioning by a number of nodes, as you can reserve exactly the resources needed for your workload(s).
@krzysztof-jastrzebski opened a PR adding to docs a more detailed description of what @aleksandra-malinowska suggested: #742.
As to implementing overprovisioning directly in CA: it turned out to be pretty hard (even defining what "10%" means when you have multiple dimensions like cpu, mem, gpu, affinity/antiaffinity, volumes, etc. is hard). There were at least 4 unsuccessful attempts so far (including my own).
Thanks for the documentation @krzysztof-jastrzebski @aleksandra-malinowska, we'll give it a bash 馃憤
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
I reckon an over-provisioning mechanism would really be helpful
We'd really make use of something like this.
We have a documented approach using pod priority and preemption: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler.
Pod priority is enabled by default in all supported k8s versions (ie. 1.11 and later), so it should just work out of the box.
With the priority classes, is it needed that all pods have a priorityclass assigned, or do they default to 0?
They default to 0.
Most helpful comment
I reckon an over-provisioning mechanism would really be helpful