Node-rdkafka: Consumer stop consuming after Broker transport failure

Created on 21 Dec 2018  路  14Comments  路  Source: Blizzard/node-rdkafka

Hi,

We encounter a problem with consumers that stop providing new messages to the 'data' listener.
This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).

The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.

We observed this sequence of events, where all partitions of a topic stopped consuming:

  • This 'event.error' seems to indicate the beginning of the problem: Error: broker transport failure

  • After this, no stats are logged again, although they were being logged every second before that.

  • 10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:

{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded

{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)

  • This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the 'data' listener function does not receive any message, so it is not clear to us why the queue would be full.

Probably linked to #182.

Environment Information

  • OS: Debian Stretch
  • Node Version: 8.11.0
  • node-rdkafka version: 2.4.2

Consumer configuration
'api.version.request': true, 'message.max.bytes': 150 * 1024 * 1024, // 150 MB 'receive.message.max.bytes': messageMaxBytes * 1.3, // Logging 'log.connection.close': true, 'statistics.interval.ms': 1000, // Consumer-specific rdkafka settings 'group.id': group_id, 'auto.commit.interval.ms': 2000, 'enable.auto.commit': true, 'enable.auto.offset.store': true, 'enable.partition.eof': false, 'fetch.wait.max.ms': 100, 'fetch.min.bytes': 1, 'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB 'fetch.error.backoff.ms': 0, 'heartbeat.interval.ms': 1000, 'queued.min.messages': 10, 'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB 'session.timeout.ms': 7000,

stale

Most helpful comment

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it's pretty critical, since the recovery for this problem, in prod environments is not easy.

All 14 comments

Same behaviour and same error of "broker transport failure". Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it's pretty critical, since the recovery for this problem, in prod environments is not easy.

@webmakersteve +1
This issue has been popping up on our prod environment since we started using this connector.
Most of the time connector recovers, but every once in a while it becomes unresponsive.
So each day, we have at least one consumer stopping at random time of the day.

@carlessistare IMHO there is a bug in librdkafka. My observations told me that the thread stops consume inside the library. Indirect sign of this is a "solving" issue #222

Same issue at our side. Has anybody got a working solution for this? This is extremely critical now for our project.

We are also facing the same issue, Is there any fix for it?

I'm also facing the same issue. This is a critical issue which has to be fixed. Is there a working solution?

Hello, Is there any update about this issue? or a possible workaround?

We are also facing the same issue. Should we go for non-flow mode for the time being till the fix is available

Is there any progress for it?

Check the librdkafka release notes, might be time to upgrade the librdkafka provided by node-rdkafka.
https://github.com/edenhill/librdkafka/releases

Had same issue, first added every N minutes restart to my app, then switched to other lib, which is quite good for consuming messages, for producing is slow. Here I compared them

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

We are noticing a similar issue. It seems like an update to the version of librdkafka that is used by this module might be worth a try. Is there anything the community can do to help move that along?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

codeburke picture codeburke  路  3Comments

clChenLiang picture clChenLiang  路  3Comments

JaapRood picture JaapRood  路  3Comments

klalafaryan picture klalafaryan  路  5Comments

meierval picture meierval  路  4Comments