Kafka-node: Keep on getting FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]

Created on 25 Aug 2014 · 25Comments · Source: SOHU-Co/kafka-node

Hi Team.

I am trying to produce and consume Kafka messages using node library (kafka-node), I am using HighLevelConsumer API. But I keep on getting this exception at random times. and node.js server stops.

FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/home/strg/project/kafkaBroker/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /home/strg/project/kafkaBroker/node_modules/kafka-node/lib/highLevelConsumer.js:141:71

I am not sure what is the issue in this?

I have kept zookeeper timeout as: 50000.

This is my High level consumer code:

consumer = new Consumer(
client,
[
{ topic: consumeTopic } //consumeTopic is the topic which user provided
],
{
autoCommit: false
}

);

consumer.on('message', function (message) {
console.log(message);
}

If I restart the server and it works fine, but again after I keep on getting this exception. Can anyone please guide me in this? I am not able to understand what does this exception means. I tried restarting the zookeeper server and kafka server but still I am facing this exception. Any help on this would be very helpful, as I am very new to Kafka

Source

pradeepsimha143

Most helpful comment

This problem went away when I add a handler to handler for the CTRL+C case. This ensures the consumer/client is cleaned up otherwise you are at the mercy of whenever the zookeeper node timesout.

process.on('SIGINT', function() {
    highLevelConsumer.close(true, function(){
        process.exit();
    })
});

hyperlink on 4 Feb 2016

👍12

All 25 comments

Under what circumstances is the rebalance happening - are you stopping the consumer using a CTRL-C?

jezzalaycock on 27 Aug 2014

I also get that exception occasionally. Restarting my consumer fixes it, but obviously that is not the way to fix it. I'm using a HighLevelConsumer as well, with only one zookeeper and one broker. I'm using the latest 0.8.2.1 version of Kafka.

2015-03-17T17:20:37.379Z - error: error FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:170:51
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:419:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:240:13
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:237:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:600:34
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:399:29
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:389:41

jcastill0 on 17 Mar 2015

Anyone else having this problem? It happens periodically for me when I start the consumer (HighLevel).
As I said above, my kafka setup is very simple, just one broker and one zookeeper. Usually after the second attempt it will not throw that exception. I'm using the latest version (0.2.24).
Any insight would be appreicated.

thanks

** julio

jcastill0 on 23 Mar 2015

I'm facing the same problem too, however, I'm generating the clientId in a random fashion to avoid creating two consumers (on the same topic) with the same clientId using:

clientId = "worker-" + Math.floor(Math.random() * 10000)

AhmedSoliman on 5 Apr 2015

Is there any progress being made on this? I am deploying an application across several nodes elastically and am getting this error about every-other time I start up an instance. I tried upping the retry attempts to 30 (in the HighLevelConsumer's rebalance() function) and it would get up to as high as 24 before finally succeeding. I am nervous to just pick a big number and expect that to work though.

@jcastill0 In what way are you able to restart your high-level consumer? I am trying to use consumer.on('error', ...) as a way to catch and restart, but can I reuse my consumer? My client? I would appreciate a pointer :)

ericdolson on 24 Apr 2015

Same problem here and @AhmedSoliman 's solution does not seem to help... Any news?

syymza on 25 Jun 2015

Randomizing the group ID worked for my tests. I don't understand enough of kafka to know if that will mess up production if production uses a fixed group ID? Or maybe production can use a random client ID and all will be well?

CWSpear on 30 Jun 2015

Is this a node issue or kafka ? I am having the same problem.

Ovidiu-S on 16 Jul 2015

Node Exists is normally due to the zookeeper timeout. The ephemeral nodes under certain circumstances (CTRL-C for instance) don't get removed. If you're not bothered about balancing a number of consumers on a topic then I suggest you try the normal KafkaConsumer and not the HighLevel one

jezzalaycock on 16 Jul 2015

@CWSpear How are you dealing with offsets commits while using random consumer group ID?

I was sure that was what was used to keep track of consumed offset, and we use a fixed group ID in production for that reason...

felipesabino on 21 Jul 2015

@felipesabino no idea. I'm actually using a company-specific library wrapped around HighLevelConsumer, and I have dug through the code some, but I haven't been able to get very deep, so many of the inner-workings are over my head.

I'm pretty sure something's going on not in my code specifically, but either in the company's lib, or in kafka and just trying to get to the bottom of it. It's been rather bothersome, and I'm not the only one experiencing issues similar to this, but for now, randomizing the IDs works for tests. It's QA's problem now, right? ;-)

CWSpear on 22 Jul 2015

We managed to easily reproduce this errors in our environment by killing the consumer process and starting it again quickly.

We noticed that whenever our server restated, we tried reconnecting before zookeeper killed the connection (session) on its side this exception would be thrown.

To know that zookeeper killed the connection, look for a message that looks like the following:

INFO  [SessionTracker:ZooKeeperServer@347] - Expiring session 0x14eb7c676540001, timeout of 30000ms exceeded

So far we manage to avoid any FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] error on servers restarts just by delaying any reconnection by at least this session timeout time. You can find this value for your server on you zookeeper config file on the maxSessionTimeout param - http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

Also, this behavior is consistent with what @CWSpear reported, as randomizing the clientId will force zookeeper to create a new session for every new connection attempt and the exception would not be thrown. But that is far from ideal, as the clientId is what will be used to keep track of your committed offsets...

We are still observing if it will occur randomly, if that is the case may be a similar approach should be taken with the rebalance logic. Anyways, will keep you posted.

felipesabino on 23 Jul 2015

👍5

I managed to get rid of the issue by setting

maxSessionTimeout=5000

and delaying the client connection by 6 seconds. Just use timeout() for that, and it will work just fine

Ovidiu-S on 12 Aug 2015

👍1

@Ovidiu-S I could not find any maxSessionTimeout in the docs or code... do you mean sessionTimeout from node-zookeeper-client?

felipesabino on 12 Aug 2015

@felipesabino I am referring to the zookeeper server config, not the client. It is the maxSessionTimeout in the zoo.cfg file

Ovidiu-S on 13 Aug 2015

bendpx on 8 Sep 2015

👍1

any update ? I'm facing the same problem :frowning:

barockok on 4 Feb 2016

@barock19 I recommend switching to Kafka 0.9 and the new (zookeeper free) client, when it releases : https://github.com/oleksiyk/kafka

Until then ... just bypass the re-balancing issue with the above fix.

Ovidiu-S on 4 Feb 2016

This problem went away when I add a handler to handler for the CTRL+C case. This ensures the consumer/client is cleaned up otherwise you are at the mercy of whenever the zookeeper node timesout.

process.on('SIGINT', function() {
    highLevelConsumer.close(true, function(){
        process.exit();
    })
});

hyperlink on 4 Feb 2016

👍12

As suggested by @hyperlink the problem is down to the fact that the ephemeral nodes are no relinquished in zk when issuing a cntrl-c (SIGINT). Under normal failure cases the nodes are released as expected.

Moving to kafka 0,9 will require wholesale changes to the node client - however I believe the kafka guys are creating the client node - so it might be that we can simply switch to using that when available.

jezzalaycock on 4 Feb 2016

I'm having the same issue. Tried changing the zoo.cfg maxSessionTimeout and also closing the high level consumer before SIGINT. Also tried to close the client in itself. Same result

mllanes on 25 Mar 2016

Using the suggested handler, with a small modification fixed the issue for me:

Add a connection.close() on the callback
Put the process.exit() in the connection.close() callback.

rtorrero on 29 Apr 2016

@hyperlink's method works fine for me.

UPDATE: It still happens, and I have find out the real problem, please refer to #369

springuper on 6 May 2016

Same issue on my side. And I can't catch the SIGINT because AWS Elastic Beanstalk somehow does not send one. I'm pretty sure @springuper PR might fix this.

pcothenet on 12 Jul 2016

I have seen this issue occurs when the client (zoo keeper) looses connection and soon after it get connection.
Rebalance logic should account for zookeeper session time out as already specified here https://github.com/SOHU-Co/kafka-node/issues/90#issuecomment-123893422