Clickhouse: How ClickHouse does deduplication internally?

Created on 15 Oct 2020  路  9Comments  路  Source: ClickHouse/ClickHouse

Hi,

Based on documentation: https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replication/
I know ClickHouse does deduplication for identical blocks which is great. My question is how internally ClickHouse does it and how reliable it is?

I guess I read somewhere in this repository that it stores block hashes to the Zookeeper and uses these hashes to deduplicate. My question is, does it do it atomically with writing the block? What if a failure happen in the middle of loading the block and writing its hash to the Zookeeper? Could it cause any data duplication or data loss?

I think this should be the process (please correct me, if it is different):

  1. read block
  2. compute block hash
  3. write block hash to ZK
  4. if success - > write block itself -> return(success)
  5. if not -> duplicate - > return (success?) (I think)

My question is what if there is a failure after step 3. To avoid data loss/duplication, I think ClickHouse must guarantee atomicity for 1) writing the block 2) writing its hash to the Zookeeper. Does it guarantee it?

Does ClickHouse use a WAL to guarantee atomicity for writing a block? Is writing to Zookeeper is also recorded on the WAL?

Thanks

question

Most helpful comment

My question is, does it do it atomically with writing the block?

Not quite. All the logic is located in ReplicatedMergeTreeBlockOutputStream::commitPart function.

I think this should be the process (please correct me, if it is different)

The process is roughly the following:

  1. Write data to filesystem into temporary data part and calculate hash.
    Temporary data part is not visible for INSERTs.

  2. Obtain a block number from ZooKeeper and check for duplicates.
    If duplicate is found, the part will be either: rejected if already exists locally; or (since version 20.10) considered as successfully replicated if was written to another replica before but does not exists locally yet (so manual insertion of the same part to different replicas has the same effect as replication).

The following is different whether part is duplicate or not, let's assume it's not a duplicate:

  1. Rename part in filesystem to the final destination. Make it visible in part set in memory. Now data part is visible to SELECTs.

  2. Commit transaction to ZooKeeper that adds metadata about this data part into replication log and into parts set.

  3. There are three possible outcomes: transaction succeeds, transation did not succeed and unknown.
    In case it did not succeed, we rollback changes in memory and in filesystem and part is not inserted. In case of unknown result of transaction, we keep the data part and schedule checking of ZooKeeper state later when ZooKeeper session will be reinitialized. In both cases we try to return exception to user. The user may receive exception or may not (in case of connection failure). In both cases, the user must repeat insertion.

Read the full code with comments here: https://clickhouse.tech/codebrowser/html_report/ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeBlockOutputStream.cpp.html#_ZN2DB36ReplicatedMergeTreeBlockOutputStream10commitPartERNSt3__110shared_ptrIN6zkutil9ZooKeeperEEERNS2_INS_18IMergeTreeDataPartEEERKNS1_12basic_strin5740540

All 9 comments

My question is, does it do it atomically with writing the block?

Not quite. All the logic is located in ReplicatedMergeTreeBlockOutputStream::commitPart function.

I think this should be the process (please correct me, if it is different)

The process is roughly the following:

  1. Write data to filesystem into temporary data part and calculate hash.
    Temporary data part is not visible for INSERTs.

  2. Obtain a block number from ZooKeeper and check for duplicates.
    If duplicate is found, the part will be either: rejected if already exists locally; or (since version 20.10) considered as successfully replicated if was written to another replica before but does not exists locally yet (so manual insertion of the same part to different replicas has the same effect as replication).

The following is different whether part is duplicate or not, let's assume it's not a duplicate:

  1. Rename part in filesystem to the final destination. Make it visible in part set in memory. Now data part is visible to SELECTs.

  2. Commit transaction to ZooKeeper that adds metadata about this data part into replication log and into parts set.

  3. There are three possible outcomes: transaction succeeds, transation did not succeed and unknown.
    In case it did not succeed, we rollback changes in memory and in filesystem and part is not inserted. In case of unknown result of transaction, we keep the data part and schedule checking of ZooKeeper state later when ZooKeeper session will be reinitialized. In both cases we try to return exception to user. The user may receive exception or may not (in case of connection failure). In both cases, the user must repeat insertion.

Read the full code with comments here: https://clickhouse.tech/codebrowser/html_report/ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeBlockOutputStream.cpp.html#_ZN2DB36ReplicatedMergeTreeBlockOutputStream10commitPartERNSt3__110shared_ptrIN6zkutil9ZooKeeperEEERNS2_INS_18IMergeTreeDataPartEEERKNS1_12basic_strin5740540

What will happen if ClickHouse server crashes after step 3 and before step 4, i.e. before committing transaction to Zookeeper?

On next startup, it will compare the state in ZooKeeper (as ground truth) with the local state, consider the written part as "unexpected" and rename it to the detached directory.

So, the order is

  1. Write to temp file, and calculate the hash.
  2. Check the hash for duplicates
  3. Make file visible
  4. Write block hash and metadata to Zookeeper.

But what if someone in some other replica add a block with the same hash after we checked the hash, i.e. after step 2?

Shouldn't be order like this:

  1. Write the block to the temp file.
  2. atomically {check existence of hash AND write block metadata (including its hash) to Zookeeper
  3. Make file visible.

But what if someone in some other replica add a block with the same hash after we checked the hash, i.e. after step 2?

On step 4 it's also checked; atomically.

But in step 3, we made the block visible to SELECTs. So we make a block visible before the final duplicate check? Does that mean at least for some time we might see duplicates?

Does that mean at least for some time we might see duplicates?

No, there is a lock (ephemeral node that will prevent insertion of duplicates and prevent merging with the new part) that is acquired at step 2 and released at step 4.

So the logical order is as follows?

  1. Write to a temp file, and calculate the hash.
  2. Check the hash for duplicates & lock
  3. Make file visible
  4. Do duplication-checking again & Write block hash and metadata to Zookeeper.

I don't understand why we need to check the hash at step 4 then if we lock at step 2; no one should be able to add the same part if it is locked.

Another question is, if we crash after step 3, then as far as I understood, on the next startup, ClickHouse will detach the part as it is not in the Zookeeper. So, we remove some parts that we have already made visible. Why not first write to Zookeeper and then making the part visible? If the server crashes after writing to ZooKeeper and before making part visible, we can make the part visible on the next startup.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

healiseu picture healiseu  路  3Comments

innerr picture innerr  路  3Comments

derekperkins picture derekperkins  路  3Comments

zhicwu picture zhicwu  路  3Comments

jangorecki picture jangorecki  路  3Comments