Keda: cooldownPeriod parameter not working as expected

Created on 9 Apr 2020  ·  18Comments  ·  Source: kedacore/keda

Hello! I am using Keda with RabbitMQ scaler but it seems that cooldownPeriod seems not working as espected. Even though I configure cooldownPeriod: 30, Pods are reduced to minReplicaCount after default 300s.

Expected Behavior

Pods should be reduced to minReplicaCount after cooldownPeriod

Actual Behavior

Pods are reduced to minReplicaCount after default 300s

Cooldownperiod seems the only parameter that doesnt work. pollingInterval, maxReplicaCount, and minReplicaCount works correctly.

Specifications

  • KEDA Version: Master branch, Commit ef7e4e9e1753e7038d65a0afd30146293139ec15
  • Kubernetes Version: 1.15.5 Docker Desktop on Mac os, and 1.15.10 Azure AKS
  • Scaler(s): RabbitMQ
bug scaler-rabbit-mq

Most helpful comment

@marcocello you can use all the features, it depends wheter you want to scale your deployments from 0 (ie. minReplicaCount = 0) or from 1 or another number minReplicaCount = X.

KEDA manages 0 <-> 1 and HPA 1 <-> N scaling, but you don't have to care about the underlying HPA, feeding the metrics or other mechanism, it is all done for you by KEDA.

You can use this ugly hack, if you want to scale your deployment from 1, but still want to modify the cooldown parameter. You can have 2 same deployments of the same app, first static set to 1 replica. The second one scaled by KEDA with minReplicaCount = 0. But I understand that's not optimal 🤷‍♂️

Thanks for raising this issue, I will take a look on how we can improve the cooldown scenario!
(@tomkerkhove I'll investigate whether we can safely effect this)

All 18 comments

Sorry to hear, we'll get this fixed.

Just for our local repro, would you mind pasting your ScaledObject config please?

Hello Tom,

this is my ScaledObject config:

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: compute-scaledobject
  labels:
    deploymentName: compute-deployment
spec:
  scaleTargetRef:
    deploymentName: compute-deployment
  pollingInterval: 1   # Optional. Default: 30 seconds
  cooldownPeriod: 20   # Optional. Default: 300 seconds
  maxReplicaCount: 10  # Optional. Default: 100
  minReplicaCount: 1
  triggers:
  - type: rabbitmq
    metadata:
      queueName: compute
      host: RabbitMqHost
      queueLength  : '5'

These are the logs from keda-operator pod:

{"level":"info","ts":1586506360.6137688,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506360.6139557,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506360.6140635,"logger":"controller_scaledobject","msg":"Creating a new HPA","Request.Namespace":"default","Request.Name":"compute-scaledobject","HPA.Namespace":"default","HPA.Name":"keda-hpa-compute-deployment"}
{"level":"info","ts":1586506360.6502018,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506360.6503305,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506376.1512656,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506376.1515474,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506498.892882,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506498.8929482,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506513.9334903,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506513.9336584,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506514.0644085,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506514.064469,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506529.3715875,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506529.3757153,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}

Above the last log after the last scale up of compute component.

Logs below are related to scale down to 1 replica. As you can see this happen more ore less after 300s (1586506804.0495086-1586506529.3757153).

{"level":"info","ts":1586506804.0495086,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506804.0499203,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506819.0569682,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"compute-scaledobject"}
{"level":"info","ts":1586506819.0571373,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"compute-scaledobject"}

Thanks!

@marcocello could you please paste here log with log level debug enabled?
https://github.com/kedacore/keda#keda-operator-logging

@zroubalik, thanks for your help.

Attached you can find logs for keda-operator and keda-metric-apiserver.

Here the list of events I think could be useful for you in keda-operator logs:

  • at "ts":1586520723.1727092 ScaledObject was able to connect to the rabbitmq instance
  • at "ts":1586520906.380907 the scaling started
  • at "ts":1586520938.6841013 the last "scaling up" event
  • at "ts":1586521214.3048892 the scaling down event

keda-operator.log

keda-metrics-apiserver.log

I tried the sample here https://github.com/kedacore/sample-go-rabbitmq and it works, cooldownPeriod works as expected.

@marcocello if you set minReplicaCount = 0, cooldown period on your deployment will work, because it is handled by KEDA. The problem is that scaling 1<->N is handled by kubernetes HPA, and there are very limited ways how to influence HPA scaling from KEDA, there is not an option for something like cooldownPeriod in HPA itself. You can modify a similar setting on a cluster level though, but that would affect all HPAs in your cluster.

@tomkerkhove we should document that probably, wdyt?

Yes, we should do that indeed. So cooldown is mainly for 0 <-> 1, but shouldn't we manage the HPA cooldown then as well as part of KEDA?

Many thanks @zroubalik. Now it works!

Can I still use Keda functionalities with minReplicaCount != 1 or this doesn't work well with HPA?

@marcocello you can use all the features, it depends wheter you want to scale your deployments from 0 (ie. minReplicaCount = 0) or from 1 or another number minReplicaCount = X.

KEDA manages 0 <-> 1 and HPA 1 <-> N scaling, but you don't have to care about the underlying HPA, feeding the metrics or other mechanism, it is all done for you by KEDA.

You can use this ugly hack, if you want to scale your deployment from 1, but still want to modify the cooldown parameter. You can have 2 same deployments of the same app, first static set to 1 replica. The second one scaled by KEDA with minReplicaCount = 0. But I understand that's not optimal 🤷‍♂️

Thanks for raising this issue, I will take a look on how we can improve the cooldown scenario!
(@tomkerkhove I'll investigate whether we can safely effect this)

@marcocello I tried to reproduce the issue and HPA scaled down the Deployment almost immediately. I can't reproduce the behavior you are talking about, I haven't seen that long timeout. So there's nothing we can do about it from KEDA side.

If your issues were solved by minReplicaCount = 0, let me know and we can close this issue.

Hello @zroubalik, to wrap up:

  • with minReplicaCount = 0, cooldownPeriod works
  • with minReplicaCount != 0, scale down is managed by HPA

Please close the issue. Thanks to all for the help!

@zroubalik Hi, I would like to discuss about the rational behind this issue as this seems like a strange limitation to me. If I understand correctly your comments:

  • KEDA manages 0 <-> 1 and Kubernetes HPA manages 1 <-> N scaling
  • so the cooldownPeriod option is only working when we setup minReplicaCount: 0

But in my experience if I setup KEDA for exemple with:

  • minReplicaCount: 0
  • maxReplicaCount: 10
  • cooldownPeriod : 20

Then I see that KEDA is able to scale down from 10 -> 0 straight after the 20s period, whereas reading your comments I was expecting that it will first do 10 -> 1 after 300s (default value for HPA --horizontal-pod-autoscaler-downscale-stabilization), and next 1 -> 0 after 20s (cooldownPeriod value).

So my question is how come that KEDA is able to downscale from N to 0 (first N -> 1, then 1 -> 0) using the cooldownPeriod but is not able to simply downscale from N to 1 using the same cooldownPeriod? I guess there may be technical constraints from Kubernetes that limit KEDA, but here it seems like KEDA is able to do something more complex (downscale from N to 0 using cooldownPeriod) but is not able to do something more simple (downscale from N to 1 using cooldownPeriod). Is there something I am missing?

Hi @RemiGaudin

* KEDA manages 0 <-> 1 and Kubernetes HPA manages 1 <-> N scaling

that's correct and that's exactly the answer to your next question :)

So my question is how come that KEDA is able to downscale from N to 0 (first N -> 1, then 1 -> 0) using the cooldownPeriod but is not able to simply downscale from N to 1 using the same cooldownPeriod?

cooldownPeriod setting is taken into acount only when scaling is handled by KEDA operator, so:

  • If we set minReplicaCount = 1, KEDA forwards metrics directly to Metrics Server and they are then consumed by HPA, which handles scaling up and down. KEDA doesn't handle the scaling (just processing the metrics), therefore we cannot affect cooldown.

  • If we set minReplicaCount = 0, KEDA checks the metrics and "activity" on the referenced trigger (eg. Kafka). If there's some load (ie. trigger is active) KEDA scales deployment from 0 to 1 and then HPA takes over. KEDA is processing and forwarding metrics the same way as it is described above, plus KEDA still checks the "activity" on the referenced trigger. So for example, if deployment is scaled to 10 replicas and trigger become inactive (eg. Kafka topic is empty), KEDA will start calculating the cooldownPeriod, if the trigger is still inactive after the cooldownPeriod passes, KEDA will force deployment to be scaled to 0 no matter how many replicas are there currently (that's why you don't see the 300s + 20s period). The cluster wide --horizontal-pod-autoscaler-downscale-stabilization setting applies only for scaling handled by HPA (eg. scaling from 10 -> 9 etc).

FYI, for upcoming KEDA v2 release, we have added a possibility to tweak the HPA scaling behavior from ScaledObject a little bit, by implementing the standard config options: https://github.com/kedacore/keda/pull/805

@zroubalik Thanks for your explanation, but then it leads to another question: why KEDA is checking the trigger "activity" only if minReplicaCount = 0 and doesn't do it when minReplicaCount = 1?

We could imagine that KEDA checks the trigger activity whatever the minReplicaCount value so when the trigger is still inactive after the cooldownPeriod then KEDA scale down the deployment to 1 (instead of zero). Therefore the cooldownPeriod parameter would work in every scenario and that would be more straightforward and intuitive than tweaking the HPA options. Is there a technical constraint that prevent that?

@RemiGaudin I am afraid that HPA would scale the deployment back to it's target replicas in that case (in case the deployment is scaled to 0, HPA ignores it so we can do that).

@zroubalik Ok I get it now, the piece I was missing indeed is that HPA doesn't try to scale up the deployment again only if replica count = 0. Thanks for your explanations.

Glad to help!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lee0c picture lee0c  ·  4Comments

jeffhollan picture jeffhollan  ·  5Comments

audunsol picture audunsol  ·  4Comments

slayer picture slayer  ·  4Comments

ppatierno picture ppatierno  ·  4Comments