Keda: [Epic] Enable KEDA to be reliably scaled out

Created on 22 May 2019  路  18Comments  路  Source: kedacore/keda

Today KEDA runs as a single replica in the cluster. Ideally before we go 1.0 we would have a way for KEDA to scale out and partition across the available ScaledObjects for better reliability? (what would happen if KEDA crashed? Does it gracefully recover?). Also if I had hundreds of scaled objects each polling, I may need to scale out my KEDA controller. Ideally there would be some logic that enables multiple replicas to each poll event sources without needless replication.

Epic

All 18 comments

That's a fair point actually, we should see if we need to:

  • Make sure our instances are spread across the nodes
  • Number of ScaledObjects are partitioned across instances

what would happen if KEDA crashed? Does it gracefully recover?

The pod will always get restarted on a crash. KEDA itself is stateless, so it always recovers.

Given Kubernetes controller pattern of listing objects by type or by selectors on the objects themselves, what are the options there for multi instance controllers?

I was thinking we can let keda listen on all namespaces by default, but allow configuration to set the namespaces it listens to. That way you could partition your cluster however you like by namespaces.

Another way to partition in a single namespace also is to allow custom taints/tolerations on the ScaledObject which will allow users to decide on which ScaledObject goes to which KEDA controller instance.

Letting the scaled objects dictating partitions though might push more onus on the cloud operator when they don't know the scale we can support in KEDA per scale controller. I like the partitioning on namespaces though that might be too corase grained or can we can auto-assign the labels that @yaron2 is mentioning so we keep the partitioning logic internal? A scale controller when it sees a new scaled object assigns atomically a label that dictates which scale controller is in charge for that ScaledObject.

Another possible option is to perform a leader lock. The idea here is that we can set the replicas to > 1 on the keda deployment. The first controller to come up sets two annotations on the keda deployment. The first annotation is the unique GUID for the controller itself eg. leader: AAAA-AAA-AAAA-AA. The second annotation is the time last updated. If the annotation is already set then the other pods will go into a sleep loop. The leader will keep setting the time last updated annotation on the deployment (heartbeat). Every 10 seconds they will check if the two annotations are set and if the time difference > 10 i.e. answering the questions - has the leader checked in? if the leader hasn't then the first controller to determine that locks the annotations.

Seeing this as an issue already in our staging environments in testing.

For reference we have about 200 queues across two different namespaces / rabbitmq clusters.

Using the keda deployment works, but it's not keeping up with low interval patterns.

Reliably scaling out the pieces needed to ensure the intervals are kept would be great.

Also note we are seeing the time to list the metrics queuelength grow linearly with the number of queues we have. Right now it's pushing 23s and climbing.

I was able to fix most of my performance issues by changing the default replica count from 1 to 8. Just in case anyone else runs into the same issue.

Also a note the deployment from the helm chart seems to be an older version than what is deployed via the KedaScaleController.yaml.

Thanks @sc-chad - think there is a valid work item as well to do some load testing with KEDA to see how many queues are expected to be handled for a single replica on something like a standard AKS / GKE node

@zroubalik - @anirudhgarg was going to look at some other ways we may be able to scale. Would be good if you could coordinate if you were looking at the namespace 'short term' fix

Operator built by operator-sdk (I am currently workin on this) by default listens to a single namespace, it can be easily changed (it is just a configuration change) to listen to all namespaces.

There might is an option to scale the number of goroutines handling reconciliations in the operator via MaxConcurrentReconciles

@zroubalik just confirming if the WATCH_NAMESPACE will work with KEDA now given that we are running on operator SDK?

@jeffhollan yes, setting this env should do the job and operator will serve just that namespace. This variable is on 2 places in the Deployment, one for operator container and one for metrics adapter container, you need to modify both.

You can run mutliple instances of KEDA on the cluster with this setting, but you need to modify the other resources, ie. ClusterRoles,APIService,... to avoid conflicts. In particular change the namespace of the operator deployment and rename conflicting ClusterRoles (or convert them to namespaced Roles). It is a very simple change.

https://github.com/operator-framework/operator-sdk/blob/master/doc/operator-scope.md

Thanks @zroubalik - I'm going to keep open until we get something like a helm chart or some deployment yamls that make what you described a bit easier to pull off.

This might be relevant (in case the issue is confirmed)

470

If we want to run multiple KEDA controllers in the cluster, we will have to redesign the metrics adapter and decouple it from the KEDA operator.
ie. we will have one metrics adapter in the cluster and mutliple KEDA controllers pointing to the one metrics adapter.
I am curious if the perfomance problems are still relevant, since the operator was rewritten with operator-sdk framework and lot of the code (locks,..) were removed during the refactoring.

@sc-chad could you please confirm that you are still hitting the perfomance issues with v1.0?

Absolutely. Was waiting on 1.0 to retest my original setup. Thanks guys.

Thanks for re-testing @sc-chad!

Ok to close?

Was this page helpful?
0 / 5 - 0 ratings