Rancher: Autoscale of hosts and containers based on thresholds on the metrics

Created on 9 Mar 2016 · 68Comments · Source: rancher/rancher

This enhancement request is based on:
https://forums.rancher.com/t/rancher-host-autoscaling/1098

The idea is that we could configure the some thresholds on the metrics to scale up or scale down. This can be applied to the containers, and also to the hosts using the apis for host creation in the clouds.
We should probably need to be able to configure a minimum of containers and/or host when the autoscale is enabled so it doesn't necessarily kill everything on very low workload. Also a maximum so memory leaks or internal error don't make everything grow indefinitely.

Thinking in the current structure of Rancher, I imagine that the autoscale for host can be at the environment level of configuration and monitoring thresholds that apply to the hosts metrics. Note that autoscaling the host not necessarily means to autoscale the containers in the host. This probably should be combined with the containers autoscaling to move or scale the containers around to leverage the workload.
The containers autoscale can be simply at the service levels monitoring thresholds of the containers that had this enabled.

The flow that I imagine is:

The user configures a min threshold and a max threshold to one or more metrics that he want to react to.
The user configures a scale activation group of thresholds.
When the metric past the min or max threshold on a metric, activates a warning signal.
When the metric stabilizes, the warning signal is deactivated.
When all the metrics in the scale activation group are in a warning state, Rancher will react scaling up or down according to the configuration and it will set the scale activation group with a warning signal. Only one scale up or down should go at the same time to maintain consistency.
When the scale is completed, the metrics warn signals should be reviewed. If all the warn of the thresholds are still in a warning state, an no minimum or maximum scale had been reached, Rancher should scale again.

I personally believe that this feature will be extremely useful for cloud environments. What do you think?

Best,

kinenhancement statuautoclosed

Source

LRancez

👍172 ❤24 🎉21

Most helpful comment

_Please use the Add Reactions feature to "+1". This way not everyone gets notified. Thanks :)_

webwurst on 5 Apr 2016

👍25 👎1

All 68 comments

Ilya-Kuchaev on 9 Mar 2016

gdurand-globallogic on 10 Mar 2016

:+1:

CrystalMethod on 10 Mar 2016

sbehrends on 10 Mar 2016

patodk on 10 Mar 2016

fernandoneto on 10 Mar 2016

dbones on 11 Mar 2016

Snake4life on 14 Mar 2016

:+1:

hwinkel on 15 Mar 2016

pabloval on 15 Mar 2016

👍1

dahendel on 16 Mar 2016

👍1

Thanks for opening this Request LRancez. Any ideas on how this could be best implemented into Rancher. Could this be a catalog entry or should it be part of the core? Currently rancher doesn't store any tokens or credentials related to cloud providers.

I did some small test with the websockets the rancher api exposes using Digital Ocean. It's pretty strait forward creating and destroying hosts using the api. I'll have another look at this.

boedy on 16 Mar 2016

Hi @boedy , thanks for the response.
It could well be a catalog entry instead of messing with the core. It could be an approach like the one used for kubernetes. It depends of each implementation if you want to scale automatically, manually or if you don't want to scale at all.

As for the credentials, I personally like and trust the way that containers handles internal connection and only exposes the things that need to be exposed. So maybe is just simply add a small db as part of the stack is more that enough for this.

LRancez on 17 Mar 2016

oggthemiffed on 5 Apr 2016

👍1

xlight on 5 Apr 2016

👍1

_Please use the Add Reactions feature to "+1". This way not everyone gets notified. Thanks :)_

webwurst on 5 Apr 2016

👍25 👎1

I'd just realize that I didn't mention it, but the ability to scale up and down on a pre defined timed schedule, in parallel to the system metrics mechanism, will also be mostly useful.
If you want, I could generate a new issue for this schedule-based scale.

LRancez on 13 Apr 2016

The best feature. \o/

And rebalance containers between nodes. It is important also.

A help example, the closest I found to these features is:

Did you see this @alena1108 @vincent99?
Add this functionality to the roadmap would be important. :)

frekele on 26 May 2016

👍4

marsanla on 19 Jul 2016

👎3 👍2

djaccedo on 19 Jul 2016

👎3 👍2

silviupanaite on 19 Jul 2016

👎3 👍2 😄1

jhelbling on 28 Jul 2016

👎4 👍2

Release v1.2.0-pre1 has experiemental support for Kubernetes 1.3, so does that mean we get autoscaling out of the box if we spin up a Rancher Kubernetes environment? I'll have a go when I get some time and report back.

yunspace on 28 Jul 2016

Kubernetes Horizontal Pod Autoscaling doesn't work either due to #5578

yunspace on 28 Jul 2016

arkka on 5 Aug 2016

👎4 👍2

ItsReddi on 6 Aug 2016

👎4 👍2

cgarciae on 30 Aug 2016

👎4 👍2

abudargo on 3 Sep 2016

👍4 👎3

ysle on 19 Sep 2016

👎3 👍2

pstephan1187 on 23 Sep 2016

👎4 👍2

alincalinciuc on 7 Oct 2016

👎4 👍2

ArunParthiban10 on 31 Oct 2016

👎4 👍2 🎉1

Ashmawy on 31 Oct 2016

👎5 👍2

👍 👍

byronmansfield on 9 Nov 2016

👎5 👍2

dsaydon90 on 17 Nov 2016

👎7

jean-francois-labbe on 19 Nov 2016

👎7

bbraga on 21 Nov 2016

👎7

@yunspace correct me if I am wrong but it looks like "Kubernetes Horizontal Pod Auto-scaling" would only add more replicas, not more physical CPU's. As I understand it Kuberneties needs to tells rancher to issue a API call to the host providers API telling them to add or remove a node based on usage. So it would be nice to do the auto scaling with and with out Kuberneties involved.

doanerock on 5 Jan 2017

👍1

If anyone is interested in hacking this together themselves, this could be a useful reference:
https://github.com/LLParse/rancher-autoscale

This was a PoC project to scale containers in a Rancher service based on CPU and/or memory usage, but I don't imagine it would be too difficult to adapt to scaling hosts.

This was built back on Rancher 1.1.X, I am unsure if it will work on 1.2. The basic workflow was:

Launch cAdvisor proxy
Launch a service to be scaled
Configure and launch the autoscale template

LLParse on 6 Jan 2017

👍4

While it doesn't solve native autoscaling support in Rancher, we've been using the SpotInst Rancher integration to autoscale Cattle v1.2 clusters in AWS.

For instance we have a policy to scale up the cluster by 1 if the average CPU is over 90% for more than 5 minutes. Another policy will scale down the cluster by 1 if CPU usage goes below 90% after 15 minutes.

I've tested it out a few times and instances are added and removed to a cluster with Rancher handling it seamlessly. Unfortunately it only works with AWS but I think you can also set it up on Google Compute Engine. Bonus is you're running Spot Instances which are significantly cheaper than on-demand.

Here's the Rancher blog post on the integration, demo from the December 2015 online meetup, and the integration will be featured next week at the SF Bay Area Rancher meetup.

ecliptik on 6 Jan 2017

👍5

@ecliptik Could you please elaborate a bit more on how are you dealing with re-scheduling containers after adding/removing hosts? Rancher does it automatically for the same service having soft limints (i.e. run instances on hosts that don't have service presented yet), but same doesn't happen for different services (i.e. you have services A and B running on host 1 and they still would be running on same host after adding another one). Thanks!

xaka on 9 Jan 2017

@xaka we're mainly scaling up a cluster automatically if we load so many stacks up the CPU comes under pressure. This happens rarely since we've tuned each cluster instance type to accommodate the number of containers we're using. We don't scale hundreds of the same container out so it's either one container per cluster instance (global) or just a single container that gets deployed somewhere onto the cluster. Right now it's just a much easier way of adding/removing hosts to a cluster automatically and saves on cost since they're all Spot Instances.

For an application that needs more dedicated resources, we create an application specific cluster, since Rancher labels don't have a way for a host to ONLY run containers with a specific label currently.

This way if SpotInst sees high CPU usage it auto-scales the application cluster, and since the stack is setup globally, it brings up more hosts/containers to support it. It's like a modified method of an AWS auto-scaling-group but instead of using an AMI for the application it uses containers instead.

There are probably much better ways to do this, and as we progress with our use of Rancher and how others are doing things I'm sure it will change.

ecliptik on 10 Jan 2017

How good are we on this? Have we implemented some parameters for automated scaling of containers in rancher cattle.

borntorock on 31 Mar 2017

👍6

+1.
Would be neat if the metrics and thresholds can be configured as a property/label of some external object. Basically, provide rules to Rancher externally on how/when to scale ?

nittikkin on 24 May 2017

+1 has this been added yet?

gregkeys on 9 Jun 2017

👎2

VamshiChaitanya on 23 Jun 2017

👎6

devopsairtrumpet on 3 Jul 2017

👎6

I'm using Prometheus+Grafana, and I set a webhook on Rancher to scale up my webserver, then Grafana sends the webhooks according with the CPU value.
It could be a Rancher service that includes some simple CPU / Memory monitor and performs a curl to send the webhooks. Shouldn't be hard to set it.

hugodopradofernandes on 14 Jul 2017

👍8

hristovpln on 10 Aug 2017

👎6 👍1

lucaslg26 on 11 Dec 2017

👎3

rsdomingues on 2 Feb 2018

👎2

vingov on 9 Mar 2018

👎1

@hugodopradofernandes sounds good, need to try that!

+1 for integration of Prometheus+Grafana in a simmilar way out of the box!

vainkop on 9 Mar 2018

👍2

intrasenze-app on 9 Apr 2018

👎1

aandac on 9 Apr 2018

👎1

Also, I would mention that one of the most underrated metrics for autoscaling is response time, if you're dealing with a typical web app.

I don't care (that much!) if my cluster is running at 97% CPU usage if the response time is staying within healthy limits. Similarly, if the response time spikes every time the CPU is above 25%, then you will still want to scale up, even though 25% cpu wouldn't be the scale up point in most situations.

In some situations, it makes sense to scale based on what actually matters, the speed of your app, not some misc symptom like cpu.

Just 2c

benyanke on 13 Apr 2018

👍5

cwrau on 14 Apr 2018

👎1

sst1xx on 25 Apr 2018

👎1

paivaric on 29 Apr 2018

👎1

OdinLin on 29 Apr 2018

👎1

mitchellmaler on 18 May 2018

👎1

@benyanke

Prometheus metrics + webhook trigger based on them should work. But would be nice to have that functionality out of the box! Webhook for pods autoscaling + webhook for ec2 instances / hosts/ worker nodes autoscaling.

vainkop on 18 May 2018

👍1

chrisingenhaag on 30 May 2018

👎1

Well since webhook seems a nice idea and could be a valid workaround for this missing feature.
It would be nice if webhooks will be fixed. There are some issues around it, like overriding configurations from other services, not finishing upgrade processes, drain timeouts that are lost while upgrading via webhook.

Well @rancherdev, this issue is one that is open since 2 years and im notified about a +1 minimum twice a week.
What should the community do to get this feature or a running workaround? I mean there are so many people that seem to need this.

ItsReddi on 7 Jun 2018

This is obviously never going to happen, its been ignored for 2 years. Its sad, but this is why Rancher is falling out of people's comparison lists when looking at container platforms. It was nice knowing you, Rancher.

michael-henderson on 28 Jun 2018

@michael-henderson
AFAIK this is not entirely true. Rancher 2.x is focusing on kubernetes which already has auto scaling built-in.
From that standpoint I would understand not to prioritize this feature atm, even though I'm as frustrated as you're since I'm still running on Rancher 1.6
I haven't fully dived into it, but I assume this should work with 2.x. Maybe I'm wrong?

empinator on 3 Jul 2018

Since there's a lot of people wanting this, I built a little side project to act as the missing autoscale functionality for Rancher v1.6: https://autoscale.co

I already implemented autoscaling with Rancher for my own project Codemason and figured I should spin it off as a separate service anyone else who might need it. Hope it helps!