Node-rdkafka: Memory leak on disconnect() method usage with producer

Created on 17 Dec 2019 · 18Comments · Source: Blizzard/node-rdkafka

Hi,

In one of our application we are using nodejs based microservice which internally uses node-rdkafka module to send messages using kafka producer which involves making runtime call scheduled at specific interval to perform below operations:

Invoking kafka producer connect method
Calling producer using kafka produce method to send messages
Invoking kafka producer disconnect method

Disconnect method usage:

let prod =  new Kafka.Producer({
// config properties
})
prod.disconnect()

However, we have been noticing memory leak when calling node-rdfafka producer and suspecting that it is disconnect method which is causing the leak.

Thanks!

Source

Rajat-Sharma-2710

Most helpful comment

I've pushed preliminary fix on fix-producer-leaks branch. I still need help with testing. It would be great if some of you can try it out and see if it helps to prevent Producer memory leaks and also check if events are still flowing in as expected.

Also, it seems that Consumer has leaks when rebalance_cb or offset_commit_cb are set and these need to be addressed as well.

iradul on 20 May 2020

👍2

All 18 comments

What version of Node and node-rdkafka are you using? For clarification, are you creating multiple Producer instances and then disconnecting? Or is there just a few Producer instances instantiated at startup? A simple gist that can isolate and reproduce the problem would be very helpful.

codeburke on 17 Dec 2019

@codeburke Yes we are creating multiple producer instances once for each runtime call (scheduled at specific time interval) we make which involves invoking connect() and disconnect() method at every call

Node version - 10
node-rdkafka - 2.7.4

Rajat-Sharma-2710 on 20 Dec 2019

👍1

Hi, I think I'm encountering the same issue with KafkaConsumer in my use of node-rdkafka in KafkaSSE. I also connect and then disconnect many KafkaConsumer instances.

I believe the issue has something to do with event_cb. If I set this to false, the memory leak goes away. From what I can tell, the native code is holding on to a reference of the _client in its event handler.

In client.js:

  if (!no_event_cb) {
    this._client.onEvent(function eventHandler(eventType, eventData) {
    ...
   }

In connection.cc's NodeOnEvent:

  Connection* obj = ObjectWrap::Unwrap<Connection>(info.This());

  v8::Local<v8::Function> cb = info[0].As<v8::Function>();
  obj->m_event_cb.dispatcher.AddCallback(cb);

I found this after debugging a memory leak in what I thought was my code. KafkaConsumer references were piling up, but I didn't have any JS references to them. I found the reference to 'self' (The KafkaConsumer instance) in the eventHandler function in client.js:

Screen Shot 2020-03-18 at 19 56 24

I'm not sure how to fix this, but likely something needs to remove dispatcher/callback in the C++ code after the client finishes disconnecting.

ottomata on 19 Mar 2020

This doesn't help with the leak, but have you considered not doing that? I've just reviewed, by coincidence, a list of recommendations on mistakes to avoid when using Kafka, and one of the things on the list was continuously reconnecting. But perhaps this isn't possible in your use case.

sam-github on 19 Mar 2020

I'm not really re-connecting that often. KafkaSSE is a Kafka -> HTTP bridge. Each new HTTP connection results in a new KafkaConsumer. The HTTP connections themselves are long lived, but over time as the service operates the memory leak eventually shows up in a big way. In Wikimedia's usage in EventStreams, it currently takes about 8 hours before a process reaches its memory limits and is killed and restarted.

Screen Shot 2020-03-19 at 14 01 11

ottomata on 19 Mar 2020

Each new HTTP connection results in a new KafkaConsumer.

I think that's pretty much the definition of "often"!

Maybe consumers can be cached? Though if consumer has params specific to the connection you are out of luck.

sam-github on 19 Mar 2020

I think that's pretty much the definition of "often"!

Each connection is long lived and unending. The HTTP response body streamed to the client in chunked-transfer encoding. We have about 60-80 concurrent connections, with about what currently looks like 6-10 reconnects per minute (which still seems like a lot to me, likely some remote client is not doing it right :p).

Each consumer has specific subscription params (topics, offsets, etc.)

ottomata on 19 Mar 2020

👍1

BTW all our grafana dashboards are public :D
https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s

ottomata on 19 Mar 2020

Setting event_cb = false does indeed fix the memory leak.

Screen Shot 2020-03-23 at 11 08 07

ottomata on 23 Mar 2020

👍1

Hi, I think there is a second memory leak affecting the HighLevelProducer which is pretty similar to the one related to event_cb. This time it's related to dr_cb and dr_msg_cb. Like with the event_cb bug the native code is holding a reference to self in it's event handler.

https://github.com/Blizzard/node-rdkafka/blob/583d24dc0769011eaf5d2e6d853c4fd0c17783ac/lib/producer.js#L89-L98

Unfortunately just applying the event_cb fix by setting dr_cb and dr_msg_cb to false won't work, because the HighLevelProducer will automatically set dr_cb to true again.

https://github.com/Blizzard/node-rdkafka/blob/583d24dc0769011eaf5d2e6d853c4fd0c17783ac/lib/producer/high-level-producer.js#L89

ArneSchulze on 5 May 2020

We have seen this issue on our producer as well. As a workaround, we set dr_cb = false when creating the producer. Is there any plan to address this type of mem leak due to callback?

spicy-taco on 19 May 2020

Is there any plan to address this type of mem leak due to callback?

Yes. I'm working on it.

iradul on 20 May 2020

Also, it seems that Consumer has leaks when rebalance_cb or offset_commit_cb are set and these need to be addressed as well.

iradul on 20 May 2020

👍2

Hi @iradul,
I just tried out your fix and it looks like the memory leak in HighLevelProducer is gone now! The memory allocation timeline of our service running with your fix shows a much cleaner heap.

In comparison, this is the memory allocation timeline of our service running on the latest master.

memory-allocation-master

The active objects allocated in the middle of the master graph are HighLevelProducer instances which should have already been garbage collected by the end of the recording

ArneSchulze on 26 May 2020

👍1

@ArneSchulze thanks for testing this!

iradul on 26 May 2020

This is fixed with the latest version 2.9.0.

iradul on 14 Jun 2020

I think same issue is present in the consumer as well. Even after consumer disconnect queued messages are kept in the memory. Following is the code snippet which is leaking the memory.

Library version: 2.9.1

 ```
   this.consumer = new Kafka.KafkaConsumer({
        'group.id': Math.random().toString(),
        'metadata.broker.list': this.kafkaUrl,
        'enable.auto.commit': false,
        'queued.max.messages.kbytes': 10240, // 10mb queue size
    }, { 'auto.offset.reset' : 'earliest'});

    this.consumer.assign([{
            topic: this.topic,
            partition: 0,
            offset: this.from_offset,
     }])

     this.consumer.unassign();
     this.consumer.disconnect();
     this.consumer = null;

```