Keda: Prometheus: Scaling to zero not working in KEDA 1.4.0

Created on 22 Apr 2020 · 13Comments · Source: kedacore/keda

Scaling to and from zero is not working with Prometheus scaler after KEDA upgrade from 1.3.0 to 1.4.0.

Expected Behavior

KEDA should scale up deployment from zero when Prometheus metric increases from zero and scale down to zero when the metric decreases to zero.

Actual Behavior

KEDA scales deployment from zero to one replica even when Prometheus metric is zero and never scales it down to zero.

Steps to Reproduce the Problem

Deploy KEDA in GKE cluster using KEDA Helm chart in version 1.4.0.
Deploy any deployment and corresponding ScaledObject with minReplicaCount=0 and use Prometheus scaler with query that returns constant zero.
Watch that KEDA never scales the deployment down to zero.
Upgrade KEDA using the same Helm chart with KEDA downgraded to 1.3.0 using values:

image:
  keda: docker.io/kedacore/keda:1.3.0
  metricsAdapter: docker.io/kedacore/keda-metrics-adapter:1.3.0

and watch that KEDA scales the deployment to zero and again up from zero when the metric becomes positive as expected.

Specifications

KEDA Version: 1.4.0
Platform & Version: linux/amd64
Kubernetes Version: 1.15 (1.15.9-gke.24)
Scaler(s): Prometheus

bug scaler-prometheus

Source

hmoravec

All 13 comments

Could you please rerun 1.4.0 with debug log level on KEDA operator? https://github.com/kedacore/keda#setting-log-levels (or set it in the chart).

zroubalik on 22 Apr 2020

And paste here your ScaledObject please.

zroubalik on 22 Apr 2020

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: prometheus-scaling
  labels:
    deploymentName: worker
spec:
  scaleTargetRef:
    deploymentName: worker
  pollingInterval: 30
  cooldownPeriod:  60
  minReplicaCount: 0
  maxReplicaCount: 2
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
      metricName: queue_length
      threshold: '120'
      query: avg_over_time(worker_queue_length[5m])

I don't see anything suspicious in keda-logs.txt, except the last line
I0423 05:18:21.328041 1 wrap.go:47] GET /openapi/v2: (3.086429ms) 404 [ 172.16.0.23:33094]

KEDA scales replica to 1 from 0 even though the Prometheus metric is zero:
{"level":"info","ts":1587618899.730634,"logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"staging","ScaledObject.Name":"prometheus-scaling","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"staging","Deployment.Name":"worker","Original Replicas Count":0,"New Replicas Count":1}

hmoravec on 23 Apr 2020

👍1

It seems like this regression was brought by this change: https://github.com/kedacore/keda/pull/695/files#diff-a63ae5a2f6036b9f3bc750d5fe46437cR105

@droessmj what did you do this particular change? ie. Set scaler to active even for value 0?

zroubalik on 23 Apr 2020

@zroubalik That commit updated a zero result to be non-error inducing. Based on the behavior described above I'm assuming since it's now not throwing errors, the scaler ensures 1 replica is up regardless. The "fix" I introduced just returns zero as a valid metric. If we need a special case where zero is non-error inducing but not a valid metric that can be introduced, but as-is the referenced commit just allows this code to now run for zero results:

    metric := external_metrics.ExternalMetricValue{
        MetricName: metricName,
        Value:      *resource.NewQuantity(int64(val), resource.DecimalSI),
        Timestamp:  metav1.Now(),
    }

droessmj on 23 Apr 2020

@droessmj I see, but the scaler marks itself as Active even when there is zero result. So my only concern is the change in isActive() function: return val > -1, nil. This should be on the first sight reverted back to return val > 0, nil. WDYT? As your change forces the scaler to be Active all the time, thus scaling 0<->1 doesn't work.

zroubalik on 23 Apr 2020

I believe you're correct. Reverting the isActive check while retaining the other part should resolve.

droessmj on 23 Apr 2020

@hmoravec are you able to retest the change please if I send you link to dev image later today?

zroubalik on 23 Apr 2020

Would be great if anybody with Prometheus instance could check that the fix helped, just replace the images for KEDA Operator and KEDA Metrics Server. Thanks!
docker.io/zroubalik/keda:promFix
docker.io/zroubalik/keda-metrics-adapter:promFix

zroubalik on 23 Apr 2020

@zroubalik Sure, I'll test it. Btw automatic tests are planned? :-)

hmoravec on 23 Apr 2020

We are always open for PRs ;)

But you are right, this shouldn't have slipped through. Sorry about this.