Clickhouse: Merge engine in ReplacingMergeTree is not working as expected

Created on 20 Dec 2019 · 4Comments · Source: ClickHouse/ClickHouse

Hi,

I have created a table using ReplacingMergeTree
CREATE TABLE products_meta ( account_id UUID, product_external_id String, key String, value String, created_at DateTime DEFAULT now() ) ENGINE = ReplacingMergeTree() PARTITION BY account_id ORDER BY (account_id, key, value, product_external_id)

I assume that the "merge process" occurs in the background at some time according to the documentation.
My question is the following: Why after more than a month, the "merge process" in order to skip duplicates has not been executed?
Do I miss something or some configuration?

Thank you very much :)

question

Source

masosky

Most helpful comment

I use 5 different solutions in one project. Different solutions fit for different cases.

re-create table by inserting to another table without duplicates.
select from T final
select argMax(*,)
cron / optimize
order by limit by

den-crane on 20 Dec 2019

❤2

All 4 comments

https://clickhouse.yandex/docs/en/operations/table_engines/replacingmergetree/

_Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the OPTIMIZE query, don't count on using it, because the OPTIMIZE query will read and write a large amount of data._

There are many reasons for that. CH merge scheduler can decide that further merges are unreasonable because a number of parts is small e.g. ~4 but a merge of them will require huge i/o. Or because resulting parts will be bigger than maximal parts size (150GB by default).

Another reason if one inserted duplicates by 1 insert and they are in the same single part. This part will be never merged automatically because no reason to merge 1 part. Optimize ... final initiate re-merge (full re-write) even of a singe part.

den-crane on 20 Dec 2019

❤1

That explains a lot.
It happened because probably the inserts were made 1 by 1.

Thank you very much

masosky on 20 Dec 2019

Once I understood this scenario, I would like to know what is the best solution.

Guarantee in each query (add more rules) in order to select the not duplicated results. That means each query will have an extra cost.
Execute a cron process that "optimize final" each table.

If you have other solutions please tell me

Thank you very much

masosky on 20 Dec 2019

I use 5 different solutions in one project. Different solutions fit for different cases.