Autoscaler: Node over-provisioning

Created on 22 Mar 2018 · 10Comments · Source: kubernetes/autoscaler

Hey guys,

we're a heavy user of cluster-autoscaler in AWS for quite sometime, but we now want to improve the scale-up times by over-provisioning our autoscaling groups by some margin (say +n or 10%).

Is there an easy way to enable this with the current set of features?
If not, I'm happy to contribute it if you can give me a pointer where it would fit best.

Cheers,
Thomas

cluster-autoscaler lifecyclrotten

Source

thomasjungblut

👍3

Most helpful comment

I reckon an over-provisioning mechanism would really be helpful

sylr on 29 Nov 2018

👍3

All 10 comments

Not exactly, but you can achieve something like this with PriorityClasses. The idea is to create "buffer" pods which do nothing except request resources, with priority lower than your actual workload, but >0 (as CA won't add nodes for pods with negative priority value). With priority and preemption enabled, scheduler will evict those lower-priority pods to make space for higher-priority pods if necessary. The evicted buffer pod will become unschedulable and trigger scale-up.

Depending on your use case, it may work even better than overprovisioning by a number of nodes, as you can reserve exactly the resources needed for your workload(s).

aleksandra-malinowska on 22 Mar 2018

👍1

@krzysztof-jastrzebski opened a PR adding to docs a more detailed description of what @aleksandra-malinowska suggested: #742.

As to implementing overprovisioning directly in CA: it turned out to be pretty hard (even defining what "10%" means when you have multiple dimensions like cpu, mem, gpu, affinity/antiaffinity, volumes, etc. is hard). There were at least 4 unsuccessful attempts so far (including my own).

MaciekPytel on 23 Mar 2018

Thanks for the documentation @krzysztof-jastrzebski @aleksandra-malinowska, we'll give it a bash 👍

thomasjungblut on 26 Mar 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Jun 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 24 Jul 2018

I reckon an over-provisioning mechanism would really be helpful

sylr on 29 Nov 2018

👍3

We'd really make use of something like this.