Clickhouse: How to guarantee data deduplication by clickhouse?

Created on 31 Aug 2017 · 3Comments · Source: ClickHouse/ClickHouse

At first，I thought ReplacingMergeTree can do this, after i tried serveral times (insert a set of data by file with version 1, than insert the same data set with version 2), i find this method can't realize data deduplication, even if i create a materialized view by select with final keyword, or group by max(ver).

I also read it from the documents "https://clickhouse.yandex/docs/en/table_engines/replacingmergetree.html" that guarantee of data deduplication can't be done by ReplacingMergeTree .

So how do i guarantee data deduplication when some cases i have to insert a set of data many times(load data from file to table is terminated by unexpected exceptions)?

st-need-info

Source

lttPo

Most helpful comment

I can't remember which version introduced the DEDUPLICATE to the OPTIMIZE query but it was effective for me in removing dupes. It does appear to "rewrite" the whole table. Also, I can't remember if I needed to use FINAL.

OPTIMIZE TABLE tablename [FINAL] DEDUPLICATE;

daledude on 5 Sep 2017

👎3 👍2

All 3 comments

If you use Replicated tables, they will deduplicate inserted blocks of data:
-- if exactly identical block of data (same rows in same order) was inserted twice, it will be effectively inserted once.

See:
https://clickhouse.yandex/docs/en/table_engines/replication.html?highlight=deduplicated

(after "Blocks of data are deduplicated.")

Non replicated tables doesn't have this feature.