At first锛孖 thought ReplacingMergeTree can do this, after i tried serveral times (insert a set of data by file with version 1, than insert the same data set with version 2), i find this method can't realize data deduplication, even if i create a materialized view by select with final keyword, or group by max(ver).
I also read it from the documents "https://clickhouse.yandex/docs/en/table_engines/replacingmergetree.html" that guarantee of data deduplication can't be done by ReplacingMergeTree .
So how do i guarantee data deduplication when some cases i have to insert a set of data many times(load data from file to table is terminated by unexpected exceptions)?
If you use Replicated tables, they will deduplicate inserted blocks of data:
-- if exactly identical block of data (same rows in same order) was inserted twice, it will be effectively inserted once.
See:
https://clickhouse.yandex/docs/en/table_engines/replication.html?highlight=deduplicated
(after "Blocks of data are deduplicated.")
Non replicated tables doesn't have this feature.
I can't remember which version introduced the DEDUPLICATE to the OPTIMIZE query but it was effective for me in removing dupes. It does appear to "rewrite" the whole table. Also, I can't remember if I needed to use FINAL.
OPTIMIZE TABLE tablename [FINAL] DEDUPLICATE;
@lttPo do you have any further questions?
Most helpful comment
I can't remember which version introduced the DEDUPLICATE to the OPTIMIZE query but it was effective for me in removing dupes. It does appear to "rewrite" the whole table. Also, I can't remember if I needed to use FINAL.
OPTIMIZE TABLE tablename [FINAL] DEDUPLICATE;