Clickhouse: Is batch insert an atomic operation?

Created on 18 Feb 2020 · 14Comments · Source: ClickHouse/ClickHouse

Is each batch insert an atomic operation when we use batch insert operation in clickhouse-jdbc. For example, we run "execute" function for a batch insert of 1000 size, but get some exceptions when doing this. For Clickhouse jdbc terms, a batch is sent to server in a single request. But for Clickhouse terms, do these 1000 records all fail or some success some fail?

question question-answered

Source

BernieJiangDL

All 14 comments

My table engine is ReplicatedMergeTree and the records are inserted into ReplicatedMergeTree table directly rather than distributed table. How does it work without transaction guaranties when those exception happens? Do these 1000 records all fail or some success some fail?

BernieJiangDL on 18 Feb 2020

It could be atomic if this insert creates only one part.

only if it inserts into only one partition
only if number of rows is less than max_insert_block_size (and others min_insert_block_size_rows,min_insert_block_size_bytes)
only if a table does not have MV (there is no atomicity Table <-> MV)

den-crane on 18 Feb 2020

My table engine is ReplicatedMergeTree and the records are inserted into ReplicatedMergeTree table directly rather than distributed table. How does it work without transaction guaranties when those exception happens? Do these 1000 records all fail or some success some fail?

https://clickhouse.tech/docs/en/operations/table_engines/replication/

_Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings._

den-crane on 18 Feb 2020

My table engine is ReplicatedMergeTree and the records are inserted into ReplicatedMergeTree table directly rather than distributed table. How does it work without transaction guaranties when those exception happens? Do these 1000 records all fail or some success some fail?

https://clickhouse.tech/docs/en/operations/table_engines/replication/

_Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings._

Thanks for your reply. I really appreciate it. As you mentioned that data blocks are deduplicated, I am just curious that

How does clickhouse deduplicate the same rows without so-called “primary key”? Is there a md5 operation for each row or something like that? Does it happen in merge-tree merging stage?
Is the minimum of deduplicate unit a block or a row? For the clickhouse jdbc terms, does a batch equal to a block when insertion happens at very first time within a batch-insert?
Does the deduplication only happen in same node or replica node? What if I repeatedly insert the same data block into some nodes which picked up in a random way?
The last one: what does MV mean? ^_^

Thanks for your time again. I really really appreciate it!! It will help us a lot to understand clickhouse better.

BernieJiangDL on 19 Feb 2020

How does clickhouse deduplicate the same rows without so-called “primary key”? Is there a md5 operation for each row or something like that? Does it happen in merge-tree merging stage?

You do insert: insert into someREPLICATEDtable values (1),(2),(3);
CH calculates a checksum from someREPLICATEDtable|columns|types|1,2,3 and puts it into Zookeeper. Every next insert will be checked is it the same (checksum) or not. If it's a re-try insert it will be just skipped. Only 100 last checksums (inserts) are stored.

Is the minimum of deduplicate unit a block or a row? For the clickhouse jdbc terms, does a batch equal to a block when insertion happens at very first time within a batch-insert?

Whole Insert ( and it's blocks in a sense that if insert separated into 3 blocks and you retry insert it will be separated into same 3 blocks and they will be deduplicated(skipped) -> whole insert will be skipped) . Insert can be separated into blocks if it's bigger than max_insert_block_size (have more than 1mil. rows) or if it inserts data to several partitions.

Does the deduplication only happen in same node or replica node? What if I repeatedly insert the same data block into some nodes which picked up in a random way?

On all replicas. If one replica inserted some block and places into ZK some checksum other replicas will skip inserts with the same checksum.

The last one: what does MV mean? ^_^

Materialized View ( https://www.altinity.com/blog/tag/Materialized+Views )

den-crane on 19 Feb 2020

👍1

How does clickhouse deduplicate the same rows without so-called “primary key”? Is there a md5 operation for each row or something like that? Does it happen in merge-tree merging stage?

You do insert: insert into someREPLICATEDtable values (1),(2),(3);
CH calculates a checksum from someREPLICATEDtable|columns|types|1,2,3 and puts it into Zookeeper. Every next insert will be checked is it the same (checksum) or not. If it's a re-try insert it will be just skipped. Only 100 last checksums (inserts) are stored.

Is the minimum of deduplicate unit a block or a row? For the clickhouse jdbc terms, does a batch equal to a block when insertion happens at very first time within a batch-insert?

Whole Insert ( and it's blocks in a sense that if insert separated into 3 blocks and you retry insert it will be separated into same 3 blocks and they will be deduplicated(skipped) -> whole insert will be skipped) . Insert can be separated into blocks if it's bigger than max_insert_block_size (have more than 1mil. rows) or if it inserts data to several partitions.

Does the deduplication only happen in same node or replica node? What if I repeatedly insert the same data block into some nodes which picked up in a random way?

On all replicas. If one replica inserted some block and places into ZK some checksum other replicas will skip inserts with the same checksum.

The last one: what does MV mean? ^_^

Materialized View ( https://www.altinity.com/blog/tag/Materialized+Views )

Haha^_^, thanks a lot!!! It's very nice of you.

"Only 100 last checksums (inserts) are stored." In other words, if I insert the same records twice at a long interval, CH could miss the checksum of my last insertion at the second time which could
cause duplicate results。And Can the value of “100” be set larger in some way to avoid this error as much as possible？

The reason why I got those questions is that I use spark jdbc to insert data into CH and clickhouse jdbc use httpclient to mange connections which is based on connection pool. In some cases, due to every batch in the spark partition inserts into CH in a serial way, sometimes it gets a
expired connection which is disconnected by the server due to time out. Then the spark catches this exception and restarts a new task to handle those records, which causes the duplicately inserting records in this part of data(during the time of restarting a new task and running, zookeeper might store other new 100 checksums).

Any solutions for this case? Many thanks~

BernieJiangDL on 19 Feb 2020

There is undocumented MergeTree engine setting replicated_deduplication_window.

abyss7 on 19 Feb 2020

Any solutions for this case? Many thanks~

Just do not use spark jdbc to insert into CH. Generate TSV files by Spark and upload them using custom code (&jdbc) which understands CH limits and constraints.

den-crane on 19 Feb 2020

There is undocumented MergeTree engine setting replicated_deduplication_window.

Thanks for your respond. I will check it out.

BernieJiangDL on 20 Feb 2020

Any solutions for this case? Many thanks~

Just do not use spark jdbc to insert into CH. Generate TSV files by Spark and upload them using custom code (&jdbc) which understands CH limits and constraints.

Got it. Decoupling the spark task and clickhouse writer module. Thanks~~~

BernieJiangDL on 20 Feb 2020

Any solutions for this case? Many thanks~

Just do not use spark jdbc to insert into CH. Generate TSV files by Spark and upload them using custom code (&jdbc) which understands CH limits and constraints.

I ran a test for the deduplication part. For the simply way, I inserted one record twice by clickhouse client command. It did deduplicate the same record. But I am not sure if I totally understood what you mentioned about deduplication?

PS: under the clickhouse local data path like: /data1/clickhouse/data/test/t_replicate_test/20190901_0_0_0
there is a checksum.txt(there are values in it), is this for deduplication?
(For each dir part like 20190901_0_0_0, does clickhouse merge them to a new part besides store last 100 dirs and )

Also under zookeeper path like: /clickhouse/tables/01-04/t_replicate_test/replicas/{name}_{ip}/parts/20190901_0_0_0/checksums
I got [], nothing

According to what you mentioned, there must be some values under zookeeper path checksums.

Anything wrong with it? THANKS~~~~^_^

BernieJiangDL on 20 Feb 2020

ZK {table_name}/blocks

CREATE TABLE XXX(key nt64) 
ENGINE = ReplicatedReplacingMergeTree 
('/clickhouse/{cluster}/tables/XXX','{replica}') 
order by key

insert into XXX values (1);

SELECT name
FROM system.zookeeper
WHERE path = '/clickhouse/stage1/tables/XXX/blocks'

┌─name──────────────────────────────────────────┐
│ all_12634146507524498305_16646457651298263211 │
└───────────────────────────────────────────────┘

insert into XXX values (2);

SELECT name
FROM system.zookeeper
WHERE path = '/clickhouse/stage1/tables/XXX/blocks'

┌─name──────────────────────────────────────────┐
│ all_7401560436380295256_13358390750771325174  │
│ all_12634146507524498305_16646457651298263211 │
└───────────────────────────────────────────────┘

optimize table XXX final

SELECT name
FROM system.zookeeper
WHERE path = '/clickhouse/stage1/tables/XXX/blocks'

┌─name──────────────────────────────────────────┐
│ all_7401560436380295256_13358390750771325174  │
│ all_12634146507524498305_16646457651298263211 │
└───────────────────────────────────────────────┘

den-crane on 20 Feb 2020

👍1

@den-crane if I use insert into xx select where xx is only a MergeTree without replication. what does ClickHouse guarantee in case of a failure? Does the following still hold?

It could be atomic if this insert creates only one part.

only if it inserts into only one partition

only if number of rows is less than max_insert_block_size (and others min_insert_block_size_rows,min_insert_block_size_bytes)

only if a table does not have MV (there is no atomicity Table <-> MV)