Keda: Multiple pods spawned on a single message in the Azure Storage Queue

Created on 23 Dec 2019 · 12Comments · Source: kedacore/keda

Description

I’ve been using KEDA Jobs to execute a rather CPU-intensive job based on an Azure Storage Queue trigger. However, due to it needing quite some resources, this sometimes means the cluster itself has to scale by provisioning extra nodes in order to be able to assign the necessary resources and start the job. Because this takes some time (we're talking minutes here), extra jobs for the same message in the queue start to spawn in a pending state; resulting in another scale up of nodes. This goes on until the maximum amount of pods is reached.

Expected Behavior

I expected KEDA to trigger only once per message added to the queue, resulting in 1 pod per message.

Actual Behavior

Currently, KEDA spawns a pod each time the pollingInterval has passed. This results in pods being spawned every minute until the maxReplicaCount has been reached. Next to that, due to Azure's Autoscaling feature, additional nodes will be provisioned to handle the extra pods.

Steps to Reproduce the Problem

Create the following ScaledObject that requires many resources and contains a job that runs for hours/days:

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: consumer
  namespace: default
spec:
  scaleType: job
  pollingInterval: 60
  maxReplicaCount: 100
  cooldownPeriod: 30
  jobTargetRef:
    parallelism: 1
    completions: 1
    backoffLimit: 1
    template:
      spec:
        containers:
        - name: consumer
          image: registry.gitlab.xxx.com/consumer:latest
          resources:
            requests:
              memory: "1024Mi"
              cpu: "1000m"
          env:
            - name: AzureWebJobsStorage
              valueFrom:
                secretKeyRef:
                  name: az-queue-secret
                  key: AzureWebJobsStorage
            - name: QUEUE_NAME
              value: consumer-test
  triggers:
  - type: azure-queue
    metadata:
      queueName: consumer-test
      queueLength: '1'
      connection: AzureWebJobsStorage # Based on secret

Add a message to the queue
Notice that multiple pods get spawned on only one message when insufficient resources are available and Azure needs to autoscale the node-pool.

Specifications

Version: v1.0.0
Platform: Azure Kubernetes Service 1.14.8
Scaler(s): Azure Storage Queue (see yaml above)

How would I be able to force KEDA to only trigger once per message added to the queue?

azure bug

Source

JoooostB

Most helpful comment

I am trying to work on this one. Should we just set maxScale to be maxScale - runningJobs. This should limit the scaling.
If we want to have one message per job, this would mean that the queueLength must be 1.

balchua on 29 Dec 2019

❤1 👍1

All 12 comments

Looking at the code, I don't think there is a way. Perhaps the job scaler needs to first check how many Job are not completed/failed before starting a new Job.
This will mean that the job must dequeue the message when it is done and not before.

balchua on 27 Dec 2019

👍1

Will this be a potential feature or should I look into alternatives?

JoooostB on 27 Dec 2019

balchua on 29 Dec 2019

❤1 👍1

Many thanks @balchua, your explanation to resolve this issue sounds completely valid to me. Love to see that you've already submitted a pull request. Looking forward to it being merged :). If I can do anything in terms of documentation or something else; please let me know.

JoooostB on 29 Dec 2019

@JoooostB the pr has been merged. Can help test it? It might be in the master tag.
docker pull kedacore/keda:master

balchua on 16 Jan 2020

@balchua I've been running it for a few days based on a custom build from your repo. The main issue seems to be resolved, no extra pods are spawned on the same message. However, when more messages enter the queue, the amount of pods/jobs never exceeds 1.

The code seems to look at the queueLength and use this as the maximum scale target. But the issue with setting queueLength to a higher value and a low pollingInterval is that whenever the polling occurs, a new job gets deployed until the message has been dequeued. My application takes a rough 30 seconds to start before it dequeues the message, resulting in at least 6 pods (30/5s). An alternative would be setting the pollingInterval to 45-60s, but that results in some massive latency from creating the message until executing. I think we should reconsider the following comment from @Cottonglow https://github.com/kedacore/keda/pull/528#issuecomment-571197975

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: queue-job
  namespace: default
spec:
  scaleType: job
  pollingInterval: 5
  maxReplicaCount: 10
  cooldownPeriod: 30
  jobTargetRef:
    parallelism: 10
    completions: 1
    backoffLimit: 1
    template:
      spec:
        containers:
        - name: queue
          image: registry.gitlab.xxx.nl/k8s/queue:slim
          imagePullPolicy: Always
          resources:
            requests:
              memory: "512Mi"
              cpu: "1000m"
          env:
            - name: AzureWebJobsStorage
              valueFrom:
                secretKeyRef:
                  name: az-queue-secret
                  key: AzureWebJobsStorage
            - name: QUEUE_NAME
              value: queue-test
        imagePullSecrets:
        - name: gitlab-registry
  triggers:
  - type: azure-queue
    metadata:
      queueName: queue-test
      queueLength: '1'
      connection: AzureWebJobsStorage # Based on secret

JoooostB on 17 Jan 2020

Apologies @JoooostB. Let me visit this again. Maybe playing with the maxScale alone isnt enough.

balchua on 17 Jan 2020

To prevent any misunderstanding, I'll try to explain it as thorough as possible below 😃

Strictly speaking, I never want more than one job or pod for a single message, I just want it to spawn a new one whenever a new message comes in.
All jobs are long running (multiple days), let me try to explain using an example:

If message A enters the queue, the queueLength becomes 1 and a new pod should be spawned which will work on that message until it completes. To prevent confusion on if the message has already been ‘used’ or not, we instantly destroy the message from the queue. The queueLength becomes 0 again.

Running pods:

longrunning-job-856b464c88-hpnqj (message A)

One minute later another message (message B) enters the queue, so the queueLength becomes 1 again. I now expect it to just spawn a new pod next to the already running pod:

Running pods:

longrunning-job-856b464c88-hpnqj (message A)
longrunning-job-442a356c55-dojaq (message B)

But in the current situation, KEDA will always spawn the amount of pods as defined in the queueLength or it will wait for the previous job to be finished. This is not a wanted solution, I want the separate jobs to be running at the same time but never want more than one pod spawned per job/message in the queue. Would this be possible to implement?

JoooostB on 20 Jan 2020

To add one more scenario, if the job creation is slow (for example it takes minutes), and the polling time kicks in, no new pods must be created.
Prior to my pr, jobs are scaled from 0 up to the queueLength. Perhaps i need to bring this change back. @jeffhollan ?

balchua on 20 Jan 2020

Hi i am going to send a PR to address this one.
The approach i took on this is to check the pod status.
How the number of Jobs to scale is calculated is <number of msgs in queue> - <jobs where all pods are not in "Running" or "Completed"> states. Still limited by the threshold.

balchua on 7 Feb 2020

Thank you @balchua for your PR, it will help when pulling big images or scaling the cluster to handle the Job requirements.
This will also work for @JoooostB usecase where he directly deletes the message at pod startup.

But his solution is not taking advantage of the visibilityTimeout of the Azure Storage Queue.
If you want to take a look at my issue #636 where I suggest another approach for the remaining bug.
Thanks 😄

thomas-lamure on 20 Feb 2020

Thanks @thomas-lamure the PR is to tackle more general scaling situations like you mentioned. I will be happy if someone else can test the PR.
I also saw your issue you created, and i think it is best that it is handled at the scaler itself.

balchua on 20 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings