Rancher: Autoscale of hosts and containers based on thresholds on the metrics

Created on 9 Mar 2016  路  68Comments  路  Source: rancher/rancher

This enhancement request is based on:
https://forums.rancher.com/t/rancher-host-autoscaling/1098

The idea is that we could configure the some thresholds on the metrics to scale up or scale down. This can be applied to the containers, and also to the hosts using the apis for host creation in the clouds.
We should probably need to be able to configure a minimum of containers and/or host when the autoscale is enabled so it doesn't necessarily kill everything on very low workload. Also a maximum so memory leaks or internal error don't make everything grow indefinitely.

Thinking in the current structure of Rancher, I imagine that the autoscale for host can be at the environment level of configuration and monitoring thresholds that apply to the hosts metrics. Note that autoscaling the host not necessarily means to autoscale the containers in the host. This probably should be combined with the containers autoscaling to move or scale the containers around to leverage the workload.
The containers autoscale can be simply at the service levels monitoring thresholds of the containers that had this enabled.

The flow that I imagine is:

  • The user configures a min threshold and a max threshold to one or more metrics that he want to react to.
  • The user configures a scale activation group of thresholds.
  • When the metric past the min or max threshold on a metric, activates a warning signal.
  • When the metric stabilizes, the warning signal is deactivated.
  • When all the metrics in the scale activation group are in a warning state, Rancher will react scaling up or down according to the configuration and it will set the scale activation group with a warning signal. Only one scale up or down should go at the same time to maintain consistency.
  • When the scale is completed, the metrics warn signals should be reviewed. If all the warn of the thresholds are still in a warning state, an no minimum or maximum scale had been reached, Rancher should scale again.

I personally believe that this feature will be extremely useful for cloud environments. What do you think?

Best,

kinenhancement statuautoclosed

Most helpful comment

_Please use the Add Reactions feature to "+1". This way not everyone gets notified. Thanks :)_

All 68 comments

+1

+1

:+1:

+1

+1

+1

+1

+1

:+1:

+1

+1

Thanks for opening this Request LRancez. Any ideas on how this could be best implemented into Rancher. Could this be a catalog entry or should it be part of the core? Currently rancher doesn't store any tokens or credentials related to cloud providers.

I did some small test with the websockets the rancher api exposes using Digital Ocean. It's pretty strait forward creating and destroying hosts using the api. I'll have another look at this.

Hi @boedy , thanks for the response.
It could well be a catalog entry instead of messing with the core. It could be an approach like the one used for kubernetes. It depends of each implementation if you want to scale automatically, manually or if you don't want to scale at all.

As for the credentials, I personally like and trust the way that containers handles internal connection and only exposes the things that need to be exposed. So maybe is just simply add a small db as part of the stack is more that enough for this.

+1

+1

_Please use the Add Reactions feature to "+1". This way not everyone gets notified. Thanks :)_

I'd just realize that I didn't mention it, but the ability to scale up and down on a pre defined timed schedule, in parallel to the system metrics mechanism, will also be mostly useful.
If you want, I could generate a new issue for this schedule-based scale.

The best feature. \o/

And rebalance containers between nodes. It is important also.

A help example, the closest I found to these features is:

Did you see this @alena1108 @vincent99?
Add this functionality to the roadmap would be important. :)

+1

+1

+1

+1

Release v1.2.0-pre1 has experiemental support for Kubernetes 1.3, so does that mean we get autoscaling out of the box if we spin up a Rancher Kubernetes environment? I'll have a go when I get some time and report back.

Kubernetes Horizontal Pod Autoscaling doesn't work either due to #5578

+1

+1

+1

+1

+1

+1

+1

+1

+1

馃憤 馃憤

+1

+1

+1

@yunspace correct me if I am wrong but it looks like "Kubernetes Horizontal Pod Auto-scaling" would only add more replicas, not more physical CPU's. As I understand it Kuberneties needs to tells rancher to issue a API call to the host providers API telling them to add or remove a node based on usage. So it would be nice to do the auto scaling with and with out Kuberneties involved.

If anyone is interested in hacking this together themselves, this could be a useful reference:
https://github.com/LLParse/rancher-autoscale

This was a PoC project to scale containers in a Rancher service based on CPU and/or memory usage, but I don't imagine it would be too difficult to adapt to scaling hosts.

This was built back on Rancher 1.1.X, I am unsure if it will work on 1.2. The basic workflow was:

  1. Launch cAdvisor proxy
  2. Launch a service to be scaled
  3. Configure and launch the autoscale template

While it doesn't solve native autoscaling support in Rancher, we've been using the SpotInst Rancher integration to autoscale Cattle v1.2 clusters in AWS.

For instance we have a policy to scale up the cluster by 1 if the average CPU is over 90% for more than 5 minutes. Another policy will scale down the cluster by 1 if CPU usage goes below 90% after 15 minutes.

I've tested it out a few times and instances are added and removed to a cluster with Rancher handling it seamlessly. Unfortunately it only works with AWS but I think you can also set it up on Google Compute Engine. Bonus is you're running Spot Instances which are significantly cheaper than on-demand.

Here's the Rancher blog post on the integration, demo from the December 2015 online meetup, and the integration will be featured next week at the SF Bay Area Rancher meetup.

@ecliptik Could you please elaborate a bit more on how are you dealing with re-scheduling containers after adding/removing hosts? Rancher does it automatically for the same service having soft limints (i.e. run instances on hosts that don't have service presented yet), but same doesn't happen for different services (i.e. you have services A and B running on host 1 and they still would be running on same host after adding another one). Thanks!

@xaka we're mainly scaling up a cluster automatically if we load so many stacks up the CPU comes under pressure. This happens rarely since we've tuned each cluster instance type to accommodate the number of containers we're using. We don't scale hundreds of the same container out so it's either one container per cluster instance (global) or just a single container that gets deployed somewhere onto the cluster. Right now it's just a much easier way of adding/removing hosts to a cluster automatically and saves on cost since they're all Spot Instances.

For an application that needs more dedicated resources, we create an application specific cluster, since Rancher labels don't have a way for a host to ONLY run containers with a specific label currently.

This way if SpotInst sees high CPU usage it auto-scales the application cluster, and since the stack is setup globally, it brings up more hosts/containers to support it. It's like a modified method of an AWS auto-scaling-group but instead of using an AMI for the application it uses containers instead.

There are probably much better ways to do this, and as we progress with our use of Rancher and how others are doing things I'm sure it will change.

How good are we on this? Have we implemented some parameters for automated scaling of containers in rancher cattle.

+1.
Would be neat if the metrics and thresholds can be configured as a property/label of some external object. Basically, provide rules to Rancher externally on how/when to scale ?

+1 has this been added yet?

+1

+1

I'm using Prometheus+Grafana, and I set a webhook on Rancher to scale up my webserver, then Grafana sends the webhooks according with the CPU value.
It could be a Rancher service that includes some simple CPU / Memory monitor and performs a curl to send the webhooks. Shouldn't be hard to set it.

+1

+1

+1

+1

@hugodopradofernandes sounds good, need to try that!

+1 for integration of Prometheus+Grafana in a simmilar way out of the box!

+1

+1

1+

Also, I would mention that one of the most underrated metrics for autoscaling is response time, if you're dealing with a typical web app.

I don't care (that much!) if my cluster is running at 97% CPU usage if the response time is staying within healthy limits. Similarly, if the response time spikes every time the CPU is above 25%, then you will still want to scale up, even though 25% cpu wouldn't be the scale up point in most situations.

In some situations, it makes sense to scale based on what actually matters, the speed of your app, not some misc symptom like cpu.

Just 2c

+1

+1

+1

+1

+1

@benyanke

Prometheus metrics + webhook trigger based on them should work. But would be nice to have that functionality out of the box! Webhook for pods autoscaling + webhook for ec2 instances / hosts/ worker nodes autoscaling.

+1

Well since webhook seems a nice idea and could be a valid workaround for this missing feature.
It would be nice if webhooks will be fixed. There are some issues around it, like overriding configurations from other services, not finishing upgrade processes, drain timeouts that are lost while upgrading via webhook.

Well @rancherdev, this issue is one that is open since 2 years and im notified about a +1 minimum twice a week.
What should the community do to get this feature or a running workaround? I mean there are so many people that seem to need this.

This is obviously never going to happen, its been ignored for 2 years. Its sad, but this is why Rancher is falling out of people's comparison lists when looking at container platforms. It was nice knowing you, Rancher.

@michael-henderson
AFAIK this is not entirely true. Rancher 2.x is focusing on kubernetes which already has auto scaling built-in.
From that standpoint I would understand not to prioritize this feature atm, even though I'm as frustrated as you're since I'm still running on Rancher 1.6
I haven't fully dived into it, but I assume this should work with 2.x. Maybe I'm wrong?

Since there's a lot of people wanting this, I built a little side project to act as the missing autoscale functionality for Rancher v1.6: https://autoscale.co

I already implemented autoscaling with Rancher for my own project Codemason and figured I should spin it off as a separate service anyone else who might need it. Hope it helps!

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

Was this page helpful?
0 / 5 - 0 ratings