Hi,
I have created a table using ReplacingMergeTree
CREATE TABLE products_meta
(
account_id UUID,
product_external_id String,
key String,
value String,
created_at DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree()
PARTITION BY account_id
ORDER BY (account_id, key, value, product_external_id)
I assume that the "merge process" occurs in the background at some time according to the documentation.
My question is the following: Why after more than a month, the "merge process" in order to skip duplicates has not been executed?
Do I miss something or some configuration?
Thank you very much :)
https://clickhouse.yandex/docs/en/operations/table_engines/replacingmergetree/
_Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the OPTIMIZE query, don't count on using it, because the OPTIMIZE query will read and write a large amount of data._
There are many reasons for that. CH merge scheduler can decide that further merges are unreasonable because a number of parts is small e.g. ~4 but a merge of them will require huge i/o. Or because resulting parts will be bigger than maximal parts size (150GB by default).
Another reason if one inserted duplicates by 1 insert and they are in the same single part. This part will be never merged automatically because no reason to merge 1 part. Optimize ... final initiate re-merge (full re-write) even of a singe part.
That explains a lot.
It happened because probably the inserts were made 1 by 1.
Thank you very much
Once I understood this scenario, I would like to know what is the best solution.
If you have other solutions please tell me
Thank you very much
I use 5 different solutions in one project. Different solutions fit for different cases.
Most helpful comment
I use 5 different solutions in one project. Different solutions fit for different cases.