Kafka-node: Keep on getting FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]

Created on 25 Aug 2014  路  25Comments  路  Source: SOHU-Co/kafka-node

Hi Team.

I am trying to produce and consume Kafka messages using node library (kafka-node), I am using HighLevelConsumer API. But I keep on getting this exception at random times. and node.js server stops.

FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/home/strg/project/kafkaBroker/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /home/strg/project/kafkaBroker/node_modules/kafka-node/lib/highLevelConsumer.js:141:71

I am not sure what is the issue in this?

I have kept zookeeper timeout as: 50000.

This is my High level consumer code:

consumer = new Consumer(
client,
[
{ topic: consumeTopic } //consumeTopic is the topic which user provided
],
{
autoCommit: false
}

);

consumer.on('message', function (message) {
console.log(message);
}

If I restart the server and it works fine, but again after I keep on getting this exception. Can anyone please guide me in this? I am not able to understand what does this exception means. I tried restarting the zookeeper server and kafka server but still I am facing this exception. Any help on this would be very helpful, as I am very new to Kafka

Most helpful comment

This problem went away when I add a handler to handler for the CTRL+C case. This ensures the consumer/client is cleaned up otherwise you are at the mercy of whenever the zookeeper node timesout.

process.on('SIGINT', function() {
    highLevelConsumer.close(true, function(){
        process.exit();
    })
});

All 25 comments

Under what circumstances is the rebalance happening - are you stopping the consumer using a CTRL-C?

I also get that exception occasionally. Restarting my consumer fixes it, but obviously that is not the way to fix it. I'm using a HighLevelConsumer as well, with only one zookeeper and one broker. I'm using the latest 0.8.2.1 version of Kafka.

2015-03-17T17:20:37.379Z - error: error FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110]
at new FailedToRebalanceConsumerError (/Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/errors/FailedToRebalanceConsumerError.js:11:11)
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:170:51
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:419:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:240:13
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:237:17
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:600:34
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:399:29
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/node_modules/async/lib/async.js:144:21
at /Users/jcastillo/dev/svcbus/ConsumerSFSvc/node_modules/kafka-node/lib/highLevelConsumer.js:389:41

Anyone else having this problem? It happens periodically for me when I start the consumer (HighLevel).
As I said above, my kafka setup is very simple, just one broker and one zookeeper. Usually after the second attempt it will not throw that exception. I'm using the latest version (0.2.24).
Any insight would be appreicated.

thanks

** julio

I'm facing the same problem too, however, I'm generating the clientId in a random fashion to avoid creating two consumers (on the same topic) with the same clientId using:

clientId = "worker-" + Math.floor(Math.random() * 10000)

Is there any progress being made on this? I am deploying an application across several nodes elastically and am getting this error about every-other time I start up an instance. I tried upping the retry attempts to 30 (in the HighLevelConsumer's rebalance() function) and it would get up to as high as 24 before finally succeeding. I am nervous to just pick a big number and expect that to work though.

@jcastill0 In what way are you able to restart your high-level consumer? I am trying to use consumer.on('error', ...) as a way to catch and restart, but can I reuse my consumer? My client? I would appreciate a pointer :)

Same problem here and @AhmedSoliman 's solution does not seem to help... Any news?

Randomizing the group ID worked for my tests. I don't understand enough of kafka to know if that will mess up production if production uses a fixed group ID? Or maybe production can use a random client ID and all will be well?

Is this a node issue or kafka ? I am having the same problem.

Node Exists is normally due to the zookeeper timeout. The ephemeral nodes under certain circumstances (CTRL-C for instance) don't get removed. If you're not bothered about balancing a number of consumers on a topic then I suggest you try the normal KafkaConsumer and not the HighLevel one

@CWSpear How are you dealing with offsets commits while using random consumer group ID?

I was sure that was what was used to keep track of consumed offset, and we use a fixed group ID in production for that reason...

@felipesabino no idea. I'm actually using a company-specific library wrapped around HighLevelConsumer, and I have dug through the code some, but I haven't been able to get very deep, so many of the inner-workings are over my head.

I'm pretty sure something's going on not in my code specifically, but either in the company's lib, or in kafka and just trying to get to the bottom of it. It's been rather bothersome, and I'm not the only one experiencing issues similar to this, but for now, randomizing the IDs works for tests. It's QA's problem now, right? ;-)

We managed to easily reproduce this errors in our environment by killing the consumer process and starting it again quickly.

We noticed that whenever our server restated, we tried reconnecting before zookeeper killed the connection (session) on its side this exception would be thrown.

To know that zookeeper killed the connection, look for a message that looks like the following:

INFO  [SessionTracker:ZooKeeperServer@347] - Expiring session 0x14eb7c676540001, timeout of 30000ms exceeded

So far we manage to avoid any FailedToRebalanceConsumerError: Exception: NODE_EXISTS[-110] error on servers restarts just by delaying any reconnection by at least this session timeout time. You can find this value for your server on you zookeeper config file on the maxSessionTimeout param - http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

Also, this behavior is consistent with what @CWSpear reported, as randomizing the clientId will force zookeeper to create a new session for every new connection attempt and the exception would not be thrown. But that is far from ideal, as the clientId is what will be used to keep track of your committed offsets...

We are still observing if it will occur randomly, if that is the case may be a similar approach should be taken with the rebalance logic. Anyways, will keep you posted.

I managed to get rid of the issue by setting

maxSessionTimeout=5000

and delaying the client connection by 6 seconds. Just use timeout() for that, and it will work just fine

@Ovidiu-S I could not find any maxSessionTimeout in the docs or code... do you mean sessionTimeout from node-zookeeper-client?

@felipesabino I am referring to the zookeeper server config, not the client. It is the maxSessionTimeout in the zoo.cfg file

+1

any update ? I'm facing the same problem :frowning:

@barock19 I recommend switching to Kafka 0.9 and the new (zookeeper free) client, when it releases : https://github.com/oleksiyk/kafka

Until then ... just bypass the re-balancing issue with the above fix.

This problem went away when I add a handler to handler for the CTRL+C case. This ensures the consumer/client is cleaned up otherwise you are at the mercy of whenever the zookeeper node timesout.

process.on('SIGINT', function() {
    highLevelConsumer.close(true, function(){
        process.exit();
    })
});

As suggested by @hyperlink the problem is down to the fact that the ephemeral nodes are no relinquished in zk when issuing a cntrl-c (SIGINT). Under normal failure cases the nodes are released as expected.

Moving to kafka 0,9 will require wholesale changes to the node client - however I believe the kafka guys are creating the client node - so it might be that we can simply switch to using that when available.

I'm having the same issue. Tried changing the zoo.cfg maxSessionTimeout and also closing the high level consumer before SIGINT. Also tried to close the client in itself. Same result

Using the suggested handler, with a small modification fixed the issue for me:

  • Add a connection.close() on the callback
  • Put the process.exit() in the connection.close() callback.

@hyperlink's method works fine for me.

UPDATE: It still happens, and I have find out the real problem, please refer to #369

Same issue on my side. And I can't catch the SIGINT because AWS Elastic Beanstalk somehow does not send one. I'm pretty sure @springuper PR might fix this.

I have seen this issue occurs when the client (zoo keeper) looses connection and soon after it get connection.
Rebalance logic should account for zookeeper session time out as already specified here https://github.com/SOHU-Co/kafka-node/issues/90#issuecomment-123893422

Was this page helpful?
0 / 5 - 0 ratings