Hi everyone,
I started to use keda with the kafka scaler, defined it pretty simple according to the example and after deploying it to production, I noticed that every 3 days the pod reaches the kubernetes limits and get OOM. The memory increasing constantly and I'm not really sure why.
This is the deployment description (I added few parameters such as limits, priorityClass and others):
Name: keda-operator
Namespace: keda
CreationTimestamp: Mon, 27 Apr 2020 12:56:42 +0300
Labels: app=keda-operator
app.kubernetes.io/component=operator
app.kubernetes.io/name=keda-operator
app.kubernetes.io/part-of=keda-operator
app.kubernetes.io/version=1.4.1
Annotations: deployment.kubernetes.io/revision: 2
Selector: app=keda-operator
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=keda-operator
Service Account: keda-operator
Containers:
keda-operator:
Image: docker.io/kedacore/keda:1.4.1
Port: <none>
Host Port: <none>
Command:
keda
Args:
--zap-level=info
Limits:
cpu: 100m
memory: 200Mi
Requests:
cpu: 100m
memory: 200Mi
Environment:
WATCH_NAMESPACE:
POD_NAME: (v1:metadata.name)
OPERATOR_NAME: keda-operator
Mounts: <none>
Volumes: <none>
Priority Class Name: line-of-business-service
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: keda-operator-fd678455c (1/1 replicas created)
Events: <none>
The Heap starts at around 30-40M and rinse till almost 200M, jumps to 240 and up and get OOM and restarted by the kubernetes daemon set.
@jeli8-cor thanks for submitting the issue. Are you willing to help a little bit with tracking down the bug? Could you please ping me on slack and we can sync?
Sure, I would love to help with that. How can I find you in slack?
@jeli8-cor great, you can find me on kubernetes slack, #keda channel
Following a few tests and debugs with the dear friend @zroubalik here, I'm updating about the issue:
The issue exists in versions: 1.3.0, 1.4.0, 1.4.1 (these are the ones I checked).
The issue resolved in version V2 , still in alpha, but currently the pod runs over 3 hours now without even a minor jump in memory.
The scaler I used was Kafka and I tested it with increasing the lag, increasing the throughput, and "trigger" the scale up and down of the scaler.
According to my checks I test only Kafka scaler and prometheus scaler. Not sure 100%, but I think it's in prometheus scaler too.
@jeli8-cor thanks a lot for the testing!
We should speed up development (and release) of v2, in case we are not able to find the cause and proved a fix for this issue in v1.
Adding note that this should be included in changelog, so we don't forget.
also experienced the same memory leak issues using keda v1.4.1 with the redis list scaler, but I upgraded to v1.5.0 and looks like that resolved it
Memory leak still exists in v2.0.0-beta, but it seems to be better. Memory and CPU resources rise slowly

@lallinger-arbeit would you mind sharing, which scalers (and how many of them) are you using? Thanks
@zroubalik Yeah, we have 23 scaledobjects of which 17 use a kafka trigger and 6 a redis trigger
+1 to CPU and memory usage rising over time. 1000 SO exclusively using the Kafka scaler:

also experienced the same memory leak issues using keda v1.4.1 with the redis list scaler, but I upgraded to v1.5.0 and looks like that resolved it
Having only redis scalers and running on 2.0 I can confirm it is still existing for me.

Will try 2.1 and if it still is a thing create some own issue on that as is don't see a reference to redis directly in this one.
Most helpful comment
@jeli8-cor thanks a lot for the testing!
We should speed up development (and release) of v2, in case we are not able to find the cause and proved a fix for this issue in v1.