Clickhouse: Merge engine in ReplacingMergeTree is not working as expected

Created on 20 Dec 2019  Â·  4Comments  Â·  Source: ClickHouse/ClickHouse

Hi,

I have created a table using ReplacingMergeTree
CREATE TABLE products_meta ( account_id UUID, product_external_id String, key String, value String, created_at DateTime DEFAULT now() ) ENGINE = ReplacingMergeTree() PARTITION BY account_id ORDER BY (account_id, key, value, product_external_id)

I assume that the "merge process" occurs in the background at some time according to the documentation.
My question is the following: Why after more than a month, the "merge process" in order to skip duplicates has not been executed?
Do I miss something or some configuration?

Thank you very much :)

question

Most helpful comment

I use 5 different solutions in one project. Different solutions fit for different cases.

  1. re-create table by inserting to another table without duplicates.
  2. select from T final
  3. select argMax(*,)
  4. cron / optimize
  5. order by limit by

All 4 comments

https://clickhouse.yandex/docs/en/operations/table_engines/replacingmergetree/

_Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the OPTIMIZE query, don't count on using it, because the OPTIMIZE query will read and write a large amount of data._

There are many reasons for that. CH merge scheduler can decide that further merges are unreasonable because a number of parts is small e.g. ~4 but a merge of them will require huge i/o. Or because resulting parts will be bigger than maximal parts size (150GB by default).

Another reason if one inserted duplicates by 1 insert and they are in the same single part. This part will be never merged automatically because no reason to merge 1 part. Optimize ... final initiate re-merge (full re-write) even of a singe part.

That explains a lot.
It happened because probably the inserts were made 1 by 1.

Thank you very much

Once I understood this scenario, I would like to know what is the best solution.

  1. Guarantee in each query (add more rules) in order to select the not duplicated results. That means each query will have an extra cost.
  2. Execute a cron process that "optimize final" each table.

If you have other solutions please tell me

Thank you very much

I use 5 different solutions in one project. Different solutions fit for different cases.

  1. re-create table by inserting to another table without duplicates.
  2. select from T final
  3. select argMax(*,)
  4. cron / optimize
  5. order by limit by
Was this page helpful?
0 / 5 - 0 ratings