Clickhouse: Kafka consumer stops - cppkafka::HandleException Local: Timed out

Created on 3 Jul 2019 · 18Comments · Source: ClickHouse/ClickHouse

Describe the bug
ClickHouse Kafka consumer stops consuming from Kafka, and restarts after some time (from 0 up to 2 hours)... Restart of ClickHouse docker container helps imediately. Problem started on 19.9.2.4, works ok on 19.8.3.8. Configuration (.conf and tables) are same for both versions.

How to reproduce
problem is here - official docker image: yandex/clickhouse-server:19.9.2.4
and this works ok: official docker image: yandex/clickhouse-server:19.8.3.8

Running on single server (no sharding/replication). Total 20 kafka topic->mv->table definitions. Pulling from 2 kafka servers with max 2 partitions per topic. Kafka engine with num_consumers: 1

ENGINE = Kafka SETTINGS kafka_broker_list = '', kafka_topic_list = 'eventdata', kafka_group_name = '', kafka_format = 'CSV', kafka_num_consumers = 1,kafka_row_delimiter = '\n',kafka_skip_broken_messages = 1;
kafka broker list defined in kafka.conf (see below)

Expected behavior
Same behavior after CH version change.

Error message and/or stacktrace
clickhouse-server.err.log

019.07.03 07:11:12.581555 [ 1 ] {} <Error> Application: Listen [::]:8123 failed: 0: DNS error: EAI: -9  If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2019.07.03 07:26:32.664575 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:27:23.068116 [ 43 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:33:12.821698 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)

Additional context
Add any other context about the problem here.

clickhouse-server.log


2019.07.03 07:26:32.309536 [ 40 ] {} <Trace> StorageKafka (imps_raw_queueraw): Polled batch of 165 messages
2019.07.03 07:26:32.318413 [ 43 ] {} <Trace> StorageKafka (dsprtbbids_raw_queueraw): Polled batch of 92 messages
2019.07.03 07:26:32.318422 [ 41 ] {} <Trace> StorageKafka (prebiddetaileventdata_queueraw): Polled batch of 4821 messages
2019.07.03 07:26:32.407451 [ 32 ] {} <Debug> StorageKafka (dspimpsvisibility_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.407450 [ 34 ] {} <Debug> StorageKafka (dspsegments_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.408022 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Polled batch of 0 messages
2019.07.03 07:26:32.408072 [ 35 ] {} <Trace> BlockInputStreamFromRowInputStream: Skipped 0 rows with errors while reading the input stream
2019.07.03 07:26:32.448213 [ 31 ] {} <Trace> StorageKafka (ssprtbbids_raw_queueraw): Polled batch of 198 messages
2019.07.03 07:26:32.469842 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Committed offset -1001 (topic: dspconversions_attributed_raw, partition: 0)
2019.07.03 07:26:32.469886 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Committed offset -1001 (topic: dspconversions_attributed_raw, partition: 1)
2019.07.03 07:26:32.470060 [ 35 ] {} <Trace> StorageKafka (dspconversions_attributed_raw_queueraw): Execution took 562 ms.
2019.07.03 07:26:32.470127 [ 35 ] {} <Debug> StorageKafka (dsprtbcampaigneligibility_raw_queueraw): Started streaming to 1 attached views
2019.07.03 07:26:32.483302 [ 33 ] {} <Trace> StorageKafka (ssprtbrequests_raw_queueraw): Polled batch of 469 messages
2019.07.03 07:26:32.518467 [ 38 ] {} <Trace> StorageKafka (imps_ml_raw_queueraw): Polled batch of 392 messages
2019.07.03 07:26:32.564322 [ 30 ] {} <Trace> StorageKafka (sspimps_raw_queueraw): Polled batch of 267 messages
2019.07.03 07:26:32.620490 [ 42 ] {} <Trace> StorageKafka (ssprtbadirequests_raw_queueraw): Polled batch of 126 messages
2019.07.03 07:26:32.664157 [ 39 ] {} <Trace> StorageKafka (clks_raw_queueraw): Re-joining claimed consumer after failure
2019.07.03 07:26:32.664575 [ 39 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.9.2.4 (official build)
2019.07.03 07:26:32.664677 [ 39 ] {} <Trace> StorageKafka (clks_raw_queueraw): Execution took 6011 ms.
2019.07.03 07:26:32.734845 [ 36 ] {} <Trace> StorageKafka (dspshop_events_raw_queueraw): Committed offset 4563474446 (topic: dspshop_events_raw, partition: 0)
2019.07.03 07:26:32.734883 [ 36 ] {} <Trace> StorageKafka (dspshop_events_raw_queueraw): Committed offset 4479299920 (topic: dspshop_events_raw, partition: 1)
2019.07.03 07:26:32.735971 [ 36 ] {} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2019.07.03 07:26:32.742137 [ 36 ] {} <Trace> test5dsp.dspshop_events_raw_raw: Renaming temporary part tmp_insert_20190703_285_285_0 to 20190703_382482_382482_0.
2019.07.03 07:26:32.743435 [ 36 ] {} <Trace> InterpreterSelectQuery: FetchColumns -> Complete
2019.07.03 07:26:32.743566 [ 36 ] {} <Trace> Aggregator: Aggregating
2019.07.03 07:26:32.743632 [ 36 ] {} <Trace> Aggregator: Aggregation method: serialized
2019.07.03 07:26:32.743669 [ 36 ] {} <Trace> Aggregator: Aggregated. 10 to 7 rows (from 0.001 MiB) in 0.000 sec. (129720.194 rows/sec., 18.705 MiB/sec.)

docker log with Kafka debug all:

{"log":"%7|1562138792.662|RECV|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 73 bytes, CorrId 8515, rtt 202.16ms)\n","stream":"stderr","t
ime":"2019-07-03T07:26:32.663047993Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbcampaigneligibility_raw [1] MessageSet size 0, error \"Success\", MaxOffset
 21257146, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.663054736Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch topic dsprtbcampaigneligibility_raw [1] at offset 21257146 (v6)\n","stream":"stde
rr","time":"2019-07-03T07:26:32.663061956Z"}
{"log":"%7|1562138792.662|FETCH|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch 1/1/1 toppar(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.663069107Z"}
{"log":"%7|1562138792.663|SEND|ClickHouse 19.9.2.4#consumer-6| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Sent FetchRequest (v4, 105 bytes @ 0, CorrId 8516)\n","stream":"stderr","time":"2019-07-
03T07:26:32.663103313Z"}
{"log":"%7|1562138792.663|RECV|ClickHouse 19.9.2.4#consumer-14| [thrd:GroupCoordinator]: GroupCoordinator/2: Received HeartbeatResponse (v0, 2 bytes, CorrId 539, rtt 100.45ms)\n","stream":"stderr","time":
"2019-07-03T07:26:32.663254983Z"}
{"log":"%7|1562138792.663|RECV|ClickHouse 19.9.2.4#consumer-18| [thrd:GroupCoordinator]: GroupCoordinator/2: Received HeartbeatResponse (v0, 2 bytes, CorrId 535, rtt 92.35ms)\n","stream":"stderr","time":"
2019-07-03T07:26:32.663423782Z"}
{"log":"%7|1562138792.664|CGRPOP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" received op SUBSCRIBE (v0) in state up (join state started, v9 vs 0)\n","stream":"stderr","time":"2019-07-03T0
7:26:32.664361626Z"}
{"log":"%7|1562138792.664|SUBSCRIBE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": subscribe to new unset subscription of 0 topics (join state started)\n","stream":"stderr","time":"2019-07-
03T07:26:32.664374526Z"}
{"log":"%7|1562138792.664|UNSUBSCRIBE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": unsubscribe from current subscription of 1 topics (leave group=yes, join state started, v9)\n","stream":
"stderr","time":"2019-07-03T07:26:32.664378946Z"}
{"log":"%7|1562138792.664|SUBSCRIPTION|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": clearing subscribed topics list (1)\n","stream":"stderr","time":"2019-07-03T07:26:32.66438313Z"}
{"log":"%7|1562138792.664|SUBSCRIPTION|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": effective subscription list changed from 1 to 0 topic(s):\n","stream":"stderr","time":"2019-07-03T07:26
:32.664387171Z"}
{"log":"%7|1562138792.664|GRPLEADER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": resetting group leader info: unsubscribe\n","stream":"stderr","time":"2019-07-03T07:26:32.664390882Z"}
{"log":"%7|1562138792.664|REBALANCE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" is rebalancing in state up (join-state started) with assignment: unsubscribe\n","stream":"stderr","time":"2
019-07-03T07:26:32.664394693Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Library pausing 2 partition(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.664398486Z"}
{"log":"%7|1562138792.664|BARRIER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [0]: rd_kafka_toppar_op_pause_resume:2311: new version barrier v7\n","stream":"stderr","time":"2019-07-03T07:26:32.6
64401915Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [0] (v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664405503Z"}
{"log":"%7|1562138792.664|BARRIER|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [1]: rd_kafka_toppar_op_pause_resume:2311: new version barrier v7\n","stream":"stderr","time":"2019-07-03T07:26:32.6
64409703Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [1] (v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664418313Z"}
{"log":"%7|1562138792.664|ASSIGN|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\": delegating revoke of 2 partition(s) to application rebalance callback on queue rd_kafka_cgrp_new: unsubscribe
\n","stream":"stderr","time":"2019-07-03T07:26:32.664429672Z"}
{"log":"%7|1562138792.664|CGRPJOINSTATE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Group \"chraw\" changed join state started -\u003e wait-revoke-rebalance_cb (v9, state up)\n","stream":"stderr","time":
"2019-07-03T07:26:32.664436841Z"}
{"log":"%7|1562138792.664|OP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [0] received op PAUSE (v7) in fetch-state active (opv6)\n","stream":"stderr","time":"2019-07-03T07:26:32.664446775Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [0]: at offset 32000421 (state active, v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.664481843Z"}
{"log":"%7|1562138792.664|OP|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: clks_raw [1] received op PAUSE (v7) in fetch-state active (opv6)\n","stream":"stderr","time":"2019-07-03T07:26:32.664492452Z"}
{"log":"%7|1562138792.664|PAUSE|ClickHouse 19.9.2.4#consumer-4| [thrd:main]: Pause clks_raw [1]: at offset 31809479 (state active, v7)\n","stream":"stderr","time":"2019-07-03T07:26:32.66449908Z"}
{"log":"%7|1562138792.666|RECV|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 3160 bytes, CorrId 22694, rtt 146.84ms)\n","stream":"stderr"
,"time":"2019-07-03T07:26:32.666724972Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbrequests_raw [1] MessageSet size 3098, error \"Success\", MaxOffset 1188248
3842, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.666747395Z"}
{"log":"%7|1562138792.666|CONSUME|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Enqueue 114 message(s) (18571 bytes, 114 ops) on dsprtbrequests_raw [1] fetch queue (
qlen 114, v6, last_offset 11882483841, 0 ctrl msgs)\n","stream":"stderr","time":"2019-07-03T07:26:32.666837521Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch topic dsprtbrequests_raw [1] at offset 11882483842 (v6)\n","stream":"stderr","tim
e":"2019-07-03T07:26:32.666870301Z"}
{"log":"%7|1562138792.666|FETCH|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Fetch 1/1/1 toppar(s)\n","stream":"stderr","time":"2019-07-03T07:26:32.666878934Z"}
{"log":"%7|1562138792.666|SEND|ClickHouse 19.9.2.4#consumer-8| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Sent FetchRequest (v4, 94 bytes @ 0, CorrId 22695)\n","stream":"stderr","time":"2019-07-
03T07:26:32.666887044Z"}
{"log":"%7|1562138792.666|RECV|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Received FetchResponse (v4, 10871 bytes, CorrId 11928, rtt 197.33ms)\n","stream":"stderr
","time":"2019-07-03T07:26:32.667089128Z"}
{"log":"%7|1562138792.667|FETCH|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Topic dsprtbnotify_raw [0] MessageSet size 10811, error \"Success\", MaxOffset 59520521
68, Ver 6/6\n","stream":"stderr","time":"2019-07-03T07:26:32.667107892Z"}
{"log":"%7|1562138792.667|CONSUME|ClickHouse 19.9.2.4#consumer-7| [thrd:xxx.xxx.xxx.36:9092/2]: xxx.xxx.xxx.36:9092/2: Enqueue 294 message(s) (36662 bytes, 294 ops) on dsprtbnotify_raw [0] fetch queue (ql
en 294, v6, last_offset 5952052167, 0 ctrl msgs)\n","stream":"stderr","time":"2019-07-03T07:26:32.667421414Z"}

current kafka.xml

<?xml version="1.0"?>
<yandex>
    <kafka>
        <auto_offset_reset>earliest</auto_offset_reset>
        <bootstrap_servers>ct1.static.xxx.xx:9092,ct2.static.xxxxx.xx:9092</bootstrap_servers>
        <group_id>chraw</group_id>
        <debug>all</debug>
    </kafka>
</yandex>

bug comp-kafka st-in-progress v19.11 v19.9

Source

josepowera

Most helpful comment

Looks very similar to #4631 and should be fixed in v19.11+ - on v19.11 I can't reproduce the issue locally.
Also found and fixed in v19.11 the scenario when Kafka queue becomes empty for some time, but afterwards the CH doesn't read new messages.

abyss7 on 3 Jul 2019

👍2

All 18 comments

Looks suspicious that you have kafka_num_consumers=1 and multiple consumers mentioned in kafka log.
Do you consume all 20 topics with a single consumer, or do you have 20 Kafka tables with 1 consumer each?
Do you have more logs for this one StorageKafka (clks_raw_queueraw) ?

abyss7 on 3 Jul 2019

👍2

We have 20 kafka topics with 1 consumer each. This issue happens on all topics. The topic clks_raw_queueraw is a "slow" one... But we also have topics with 400000 events per minute and same things happened.

I added more logs (in the same minute for StorageKafka (clks_raw_queueraw)) if still needed - we moved backwards to old version, and I see you just commited fix for issue.
clickhouse-kafka-issue-5846.log

Is there any semi public docker repository for NON stable versions - to be able to test 19.11-unstable?

josepowera on 3 Jul 2019

@abyss7 Is there an ETA as to when v19.11+ will hit stable? The current solution is to restart the server when consumption stops, which is not ideal

volfco on 3 Jul 2019

19.8.* version does not this issue - may be you can downgrade.

hardik137 on 5 Jul 2019

@abyss7 Is there an ETA as to when v19.11+ will hit stable? The current solution is to restart the server when consumption stops, which is not ideal

It goes to pre-stable this monday, and goes stable plus two weeks.

abyss7 on 5 Jul 2019

@hardik137 Downgrading is an option, but that requires a full cluster stop due to issues we've encountered with 19.8 and 19.9.

But that looks like the best solution right now.

volfco on 5 Jul 2019

We tried to use newest version after fix with #5852 to see if the problem is solved.

Unfortunately the PROBLEM REMAINS OPEN (19.8 works but newer 19.9+ DON'T) .

We tried with both:
-19.10.1.5 (official build) - from official docker build
-19.11.0.830 (official build) from prestable debian repo and after that our docker build

We also tried with false but this did not help.

Kafka server logs:
KAFKA SERVER 1:

{"log":"[2019-07-14 10:16:07,498] DEBUG [Controller id=1] Topics not in preferred replica for broker 2 Map() (kafka.controller.KafkaController)\n","stream":"stdout","time":"2019-07-14T10:16:07.498691052Z"
}
{"log":"[2019-07-14 10:16:07,498] TRACE [Controller id=1] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController)\n","stream":"stdout","time":"2019-07-14T10:16:07.498816431Z"}
{"log":"[2019-07-14 10:16:07,498] DEBUG [Controller id=1] Topics not in preferred replica for broker 1 Map() (kafka.controller.KafkaController)\n","stream":"stdout","time":"2019-07-14T10:16:07.499040584Z"
}
{"log":"[2019-07-14 10:16:07,498] TRACE [Controller id=1] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController)\n","stream":"stdout","time":"2019-07-14T10:16:07.499059567Z"}
{"log":"[2019-07-14 10:21:07,499] TRACE [Controller id=1] Checking need to trigger auto leader balancing (kafka.controller.KafkaController)\n","stream":"stdout","time":"2019-07-14T10:21:07.500513602Z"}
{"log":"[2019-07-14 10:21:07,501] DEBUG [Controller id=1] Preferred replicas by broker Map(2 -\u003e Map(__consumer_offsets-21 -\u003e Vector(2, 1), __consumer_offsets-27 -\u003e Vector(2, 1), __consumer_
offsets-7 -\u003e Vector(2, 1), ssprtbwins_raw-0 -\u003e Vector(2, 1), __consumer_offsets-9 -\u003e Vector(2, 1), __consumer_offsets-25 -\u003e Vector(2, 1), __consumer_offsets-35 -\u003e Vector(2, 1), __
consumer_offsets-41 -\u003e Vector(2, 1), sspimps_raw-1 -\u003e Vector(2, 1), __consumer_offsets-33 -\u003e Vector(2, 1), __consumer_offsets-23 -\u003e Vector(2, 1), __consumer_offsets-49 -\u003e Vector(2

KAFKA SERVER 2:

.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:15:48.827937959Z"}
{"log":"[2019-07-14 10:15:48,890] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-c9fa103b-c690-4551-8e04-3590eb78a6d2 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:15:48.89080723Z"}
{"log":"[2019-07-14 10:15:51,268] INFO [GroupCoordinator 2]: Stabilized group chraw generation 12950 (__consumer_offsets-23) (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"2019-07
-14T10:15:51.269145626Z"}
{"log":"[2019-07-14 10:15:51,273] INFO [GroupCoordinator 2]: Assignment received from leader for group chraw for generation 12950 (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"20
19-07-14T10:15:51.273415831Z"}
{"log":"[2019-07-14 10:15:51,603] INFO [GroupCoordinator 2]: Preparing to rebalance group chraw in state PreparingRebalance with old generation 12950 (__consumer_offsets-23) (reason: Adding new member Cli
ckHouse 19.11.0.830-c7f0a98a-0127-4fe9-9857-c27bfc40cf0b) (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:15:51.603531286Z"}
{"log":"[2019-07-14 10:15:54,771] INFO [GroupCoordinator 2]: Stabilized group chraw generation 12951 (__consumer_offsets-23) (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"2019-07
-14T10:15:54.774319507Z"}
{"log":"[2019-07-14 10:15:54,795] INFO [GroupCoordinator 2]: Assignment received from leader for group chraw for generation 12951 (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"20
19-07-14T10:15:54.79722858Z"}
{"log":"[2019-07-14 10:19:08,923] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-c7f0a98a-0127-4fe9-9857-c27bfc40cf0b in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.924863614Z"}
{"log":"[2019-07-14 10:19:08,925] INFO [GroupCoordinator 2]: Preparing to rebalance group chraw in state PreparingRebalance with old generation 12951 (__consumer_offsets-23) (reason: removing member Click
House 19.11.0.830-c7f0a98a-0127-4fe9-9857-c27bfc40cf0b on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.927258648Z"}
{"log":"[2019-07-14 10:19:08,927] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-bdd24c2b-3124-4a45-983e-7d1bdb26c577 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.927593808Z"}
{"log":"[2019-07-14 10:19:08,932] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-7813b472-72b0-4090-a03b-9d9cfa4b36eb in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.93303024Z"}
{"log":"[2019-07-14 10:19:08,933] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-c8db1eff-e7e7-4408-b5c5-1b89d9426284 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.933507937Z"}
{"log":"[2019-07-14 10:19:08,949] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-7bf76679-754f-4654-98b0-f7a1f912f0e2 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.950108899Z"}
{"log":"[2019-07-14 10:19:08,950] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-6e8416f7-b290-4630-95c1-2802db462153 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.950515354Z"}
{"log":"[2019-07-14 10:19:08,952] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-9ea9dab5-b866-403f-8ba2-38df5dbf2700 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.95231104Z"}
{"log":"[2019-07-14 10:19:08,964] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-071d6255-474b-4701-9852-8c208345dba9 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.965185578Z"}
{"log":"[2019-07-14 10:19:08,965] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-83c6184b-e49d-4212-98ed-2229e95d4fc7 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:08.965458778Z"}
{"log":"[2019-07-14 10:19:09,026] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-37220998-9383-4756-ba77-a861a79a4009 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.02631275Z"}
{"log":"[2019-07-14 10:19:09,044] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-9194a5d2-0cc9-4488-b11a-fe4f4149bc5c in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.045187181Z"}
{"log":"[2019-07-14 10:19:09,084] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-982d84e2-9387-4ce5-a1c9-eae8b0095513 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.085424418Z"}
{"log":"[2019-07-14 10:19:09,191] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-a58ccd96-470d-4a30-bb6e-7582fc0694e5 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.192010341Z"}
{"log":"[2019-07-14 10:19:09,249] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-1df4db0e-266d-49e0-b8a7-cdba0e3bf152 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.250066465Z"}
{"log":"[2019-07-14 10:19:09,334] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-3450bc59-701a-475f-b4f4-f1cee6da749d in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.334891365Z"}
{"log":"[2019-07-14 10:19:09,350] INFO [GroupCoordinator 2]: Member ClickHouse 19.11.0.830-53051242-9c8b-43f9-a77e-a1785f650045 in group chraw has left, removing it from the group (kafka.coordinator.group
.GroupCoordinator)\n","stream":"stdout","time":"2019-07-14T10:19:09.352165002Z"}
{"log":"[2019-07-14 10:19:14,239] INFO [GroupCoordinator 2]: Stabilized group chraw generation 12952 (__consumer_offsets-23) (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"2019-07
-14T10:19:14.239825762Z"}
{"log":"[2019-07-14 10:19:14,243] INFO [GroupCoordinator 2]: Assignment received from leader for group chraw for generation 12952 (kafka.coordinator.group.GroupCoordinator)\n","stream":"stdout","time":"20
19-07-14T10:19:14.243329756Z"}
{"log":"[2019-07-14 10:23:15,818] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.group.GroupMetadataManager)\n","stream":"stdout","time":"2019-07-14
T10:23:15.818756143Z"}

clickhouse-server.err.log:

``
19.07.14 10:15:47.582133 [ 31 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:47.762684 [ 36 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:47.858116 [ 37 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:47.990137 [ 28 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:48.012767 [ 27 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:48.266206 [ 34 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:48.327135 [ 23 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:15:48.390321 [ 32 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.111417 [ 26 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.236088 [ 24 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.244023 [ 31 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.255541 [ 21 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.260048 [ 22 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.266288 [ 30 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.371950 [ 36 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.387675 [ 35 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.393853 [ 27 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.525112 [ 23 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.544411 [ 37 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.583748 [ 33 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.690600 [ 32 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.748842 [ 29 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.831557 [ 34 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:19:08.849200 [ 28 ] {} <Error> void DB::StorageKafka::streamThread(): std::exception. Code: 1001, type: cppkafka::HandleException, e.what() = Local: Timed out, version = 19.11.0.830 (officia
2019.07.14 10:47:07.235572 [ 33 ] {} <Error> StorageKafka (dsprtbcampaigneligibility_raw_queueraw): Consumer error: Local: Broker transport failure

and some bigger logs:

zcat clickhouse-server.log.gz | grep -F "14 10:19:" >clickhouse-server-log-1minute-1019.log
zcat clickhouse-server.log.gz | grep -F "14 10:18:" >clickhouse-server-log-1minute-1018.log
zcat *-json.log.gz | grep -F "2019-07-14T10:18:" >clickhouse-server-dockerlog-timefiltered-1018.log
zcat *-json.log.gz | grep -F "2019-07-14T10:19:" >clickhouse-server-dockerlog-timefiltered-1019.log

clickhouse-server-log-1minute-1019.log
clickhouse-server-log-1minute-1018.log
clickhouse-server-dockerlog-timefiltered-1019.log
clickhouse-server-dockerlog-timefiltered-1018.log

josepowera on 14 Jul 2019

@josepowera thanks for the test, we'll look into the problem further.

abyss7 on 15 Jul 2019

Good day collegues!
What news about the problem?
Got this bug on 19.11.3.11, nothing is help, only table and engine reacreation.

Rostish on 31 Jul 2019

@Rostish we are waiting when https://github.com/yandex/ClickHouse/pull/6075 will be released to give it a try

simpl1g on 31 Jul 2019

19.11.4.24 is out on https://hub.docker.com/r/yandex/clickhouse-server/tags

josepowera on 1 Aug 2019

we have updated to this version half an hour ago, and currently we are experiencing issues with consuming messages, everything stuck

simpl1g on 1 Aug 2019

We have updated our clickhouse too.
We are experiencing next issiue:
i have about 150 records in my kafka topic, but clickhouse is consuming it infinite times, i have already about 500 000 rows im my table.

Rostish on 2 Aug 2019

@abyss7
Hi!
Could you comment on the new bug?
I did rollback to 19.11.3.11

Rostish on 5 Aug 2019

The new bug emerged with #6075 - it shows the breakage in tests but nevertheless it was merged. I will do a proper fix ASAP.

abyss7 on 5 Aug 2019

Got this bug on 19.10.1.5 as well