Clickhouse: Clickhouse loses data being inserted via KafkaEngine

Created on 3 Apr 2019 · 6Comments · Source: ClickHouse/ClickHouse

How to reproduce

Insert into kafka topic 3000000 messages (message size is about 1KB)
Create Clickhouse tables: table with KafkaEngine (kafka_max_block_size = 100), MV, table with MergeTree

Expected behavior
"select count()" should gives 3000000 rows in table with MergeTree, but it gives less even after stream_flush_interval_ms passes (2999889, for example)

Additional context

Kafka 0.10.0.1 or 2.1.1
Clickhouse 19.3.4

bug comp-kafka v19.3

Source

alexm93

Most helpful comment

Confirmed the problem. Trying to fix.

abyss7 on 18 Apr 2019

🎉2

All 6 comments

@alexm93 try experimenting with kafka_max_block_size = 1. We ran into the same issue and temporarily resolved it by doing so. Also commented on https://github.com/yandex/ClickHouse/issues/4736

matthew-formosa-gig on 11 Apr 2019

I observe similar issues.

Pushing 1m events burst is lossless.
Pushing 10m events burst results in partial loss to the materialized view with missing rows
kafka_max_block_size = 1 Significantly slows down the population rate but still loses tens of thousands of rows.

I am detecting missing rows by row count on kafka producer vs materialized view. E.g. 11000000 events go in, 10982699 rows come out in the MV. At smaller bursts the tally is exact.

I have tried MergeTree and Memory engines for the MV.

bicubic on 17 Apr 2019

Confirmed the problem. Trying to fix.

abyss7 on 18 Apr 2019

🎉2

@abyss7 thanks! While you're dealing with that, do you mind explaining or defining CH behaviour if it can't process kafka events as quickly as they are being published?

At high rate mismatch (continuous 10m/s publish, CH only processing about 1m/s), I find that 90% of the data seems to get blackholed never to be seen again in either CH or on the queue. CH should not be consuming the queue at a faster rate than it can actually process, and that should make up part of the guarantee that every single message is eventually processed.

bicubic on 18 Apr 2019

@abyss7 thanks! While you're dealing with that, do you mind explaining or defining CH behaviour if it can't process kafka events as quickly as they are being published?

At high rate mismatch (continuous 10m/s publish, CH only processing about 1m/s), I find that 90% of the data seems to get blackholed never to be seen again in either CH or on the queue. CH should not be consuming the queue at a faster rate than it can actually process, and that should make up part of the guarantee that every single message is eventually processed.

In theory it's a plausible scenario, since there may be the situation where you have a single kafka message, which contains 1M rows, and when CH read it and started to insert rows via MV, the CH already marked it as commited. So, if CH crashes on the halfway, the rest of the rows will never be read again. If it's not the case, then I suggest to discuss it in another issue, or in Telegram Chat, if convenient.

abyss7 on 19 Apr 2019

Looks like the duplicate of #4736

abyss7 on 19 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings