Node-rdkafka: Consumer stop consuming after Broker transport failure

Created on 21 Dec 2018 · 14Comments · Source: Blizzard/node-rdkafka

Hi,

We encounter a problem with consumers that stop providing new messages to the 'data' listener.
This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).

The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.

We observed this sequence of events, where all partitions of a topic stopped consuming:

This 'event.error' seems to indicate the beginning of the problem: Error: broker transport failure
After this, no stats are logged again, although they were being logged every second before that.
10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:

{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded

{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)

This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the 'data' listener function does not receive any message, so it is not clear to us why the queue would be full.

Probably linked to #182.

Environment Information

OS: Debian Stretch
Node Version: 8.11.0
node-rdkafka version: 2.4.2

Consumer configuration
'api.version.request': true, 'message.max.bytes': 150 * 1024 * 1024, // 150 MB 'receive.message.max.bytes': messageMaxBytes * 1.3, // Logging 'log.connection.close': true, 'statistics.interval.ms': 1000, // Consumer-specific rdkafka settings 'group.id': group_id, 'auto.commit.interval.ms': 2000, 'enable.auto.commit': true, 'enable.auto.offset.store': true, 'enable.partition.eof': false, 'fetch.wait.max.ms': 100, 'fetch.min.bytes': 1, 'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB 'fetch.error.backoff.ms': 0, 'heartbeat.interval.ms': 1000, 'queued.min.messages': 10, 'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB 'session.timeout.ms': 7000,

stale

Source

Giska

👍24

Most helpful comment

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it's pretty critical, since the recovery for this problem, in prod environments is not easy.

carlessistare on 11 Apr 2019

👍9

All 14 comments

Same behaviour and same error of "broker transport failure". Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing

bobzsj87 on 20 Mar 2019

👍7

carlessistare on 11 Apr 2019

👍9

@webmakersteve +1
This issue has been popping up on our prod environment since we started using this connector.
Most of the time connector recovers, but every once in a while it becomes unresponsive.
So each day, we have at least one consumer stopping at random time of the day.

ivan83 on 12 Apr 2019

👍3

@carlessistare IMHO there is a bug in librdkafka. My observations told me that the thread stops consume inside the library. Indirect sign of this is a "solving" issue #222

mvtm-dn on 15 Apr 2019

👍4

Same issue at our side. Has anybody got a working solution for this? This is extremely critical now for our project.

smaheshw on 24 Apr 2019

👍3

We are also facing the same issue, Is there any fix for it?

aakashkharche04 on 9 May 2019

I'm also facing the same issue. This is a critical issue which has to be fixed. Is there a working solution?

RaajBadra on 11 May 2019

Hello, Is there any update about this issue? or a possible workaround?

danielAnguloG on 16 Jul 2019

We are also facing the same issue. Should we go for non-flow mode for the time being till the fix is available

cravi24 on 5 Aug 2019

Is there any progress for it?

NeoyeElf on 22 Aug 2019

Check the librdkafka release notes, might be time to upgrade the librdkafka provided by node-rdkafka.
https://github.com/edenhill/librdkafka/releases

edenhill on 22 Aug 2019

Had same issue, first added every N minutes restart to my app, then switched to other lib, which is quite good for consuming messages, for producing is slow. Here I compared them

funduck on 28 Aug 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] on 26 Nov 2019

We are noticing a similar issue. It seems like an update to the version of librdkafka that is used by this module might be worth a try. Is there anything the community can do to help move that along?