Clickhouse: OPTIMIZE FINAL doesn't work correctly.

Created on 21 Mar 2018 · 8Comments · Source: ClickHouse/ClickHouse

CREATE TABLE test
(
    date Date,
    x Int32,
    ver UInt64
)
ENGINE = MergeTree(date, x, 4096);

INSERT INTO test VALUES ('2018-03-21', 1, 1), ('2018-03-21', 1, 2);

CREATE TABLE replacing ENGINE = ReplacingMergeTree(date, x, 4096, ver) AS SELECT * FROM test;

OPTIMIZE TABLE replacing PARTITION 201803 FINAL;

Actual results:

SELECT x FROM replacing 

┌─x─┐
│ 1 │
│ 1 │
└───┘

SELECT x FROM replacing FINAL 

┌─x─┐
│ 1 │
│ 1 │
└───┘

Expected in both cases:

┌─x─┐
│ 1 │
└───┘

enhancement

Source

StarWix

👍2

Most helpful comment

Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.

This is considered Ok because it's expected that data modifications are inserted in separate INSERTions (so the data that should be replaced/collapsed/aggregated/etc is located in multiple different data parts).

But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.

Committed in master.

alexey-milovidov on 6 Apr 2018

👍5

All 8 comments

If i understand it correctly:
1) With one insert you create one part.
2) Duplicates are removed at the moment of MERGING different parts
3) OPTIMIZE forces merge process (you already have one part so it do nothing)
4) FINAL apply the same logic as merge process (so replace old version with newer from different parts)

So general workaround is: when you inserting one block of data - apply any logic of overwriting rows before insert. If you need to make some changes in already stored data - when do another insert with newer version and it will work. So your test case should work if you will split one insert to 2.

filimonov on 21 Mar 2018

❤2 👍2 🎉1

@filimonov

If you specify FINAL, optimization will be performed even when all the data is already in one part.

https://clickhouse.yandex/docs/en/single/#optimize

StarWix on 21 Mar 2018

👍2 🎉1

Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.

But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.

Committed in master.

alexey-milovidov on 6 Apr 2018

👍5

Hmm... Wouldn't it lead to big performance downgrade in some scenarios? Nop operation will be changed to big remerge?

May be it will be safer to make separate syntax for that case? Like

OPTIMIZE ... FINAL REMERGE
-- or
OPTIMIZE ... FINAL SETTINGS optimize_single_part = 1
-- or something like that?

filimonov on 7 Apr 2018

OPTIMIZE FINAL is already doing re-merge even in case of single part (as it is documented).

Initially, it was intended for GraphiteMergeTree. But for other MergeTrees, it was read and rewrite all data on disk, doing no transforms. This case for OPTIMIZE FINAL was almost useless (actually it is useful to materialize columns with complex DEFAULT expressions on disk after adding them with ALTER).

We can introduce additional flag to disable re-merge in case of single part.

alexey-milovidov on 7 Apr 2018

I'm not sure if i understand and remember that properly, but i think on telegram chat somebody mentioned that they use ReplacingMergeTree and regularly fire OPTIMIZE FINAL (in cron or something like that), to have the state of table consistent. That's not too smart, but could work. After that change that will make huge load for useless single part 'remerging', and be a bad 'surprise' after update. So may be additional flag would be safer?

filimonov on 7 Apr 2018

OPTIMIZE FINAL is already doing full data rewrite at that moment.
Please check the commit. It only removes optimization of calculations inside merge.

alexey-milovidov on 8 Apr 2018

👍1

@filimonov what filimonov say is right.

:) drop table pk_local;

:) CREATE TABLE default.pk_local ( id Int64,  name String,  age Int64, created_at Date,  updated_at Date) ENGINE = ReplacingMergeTree(updated_at) partition by created_at order by id settings enable_vertical_merge_algorithm = 1, vertical_merge_algorithm_min_rows_to_activate = 1, vertical_merge_algorithm_min_columns_to_activate = 1;

:) insert into pk_local values(2, 'old', 12, '1998-03-11', '1998-03-11');
-- 关键所在：插入一条之后需要做一次优化表
-- the most important sql. do optimize after insert every time, then replacing will work
:) optimize table pk_local;
:) insert into pk_local values(2, 'new', 12, '1998-03-11', '1998-03-13');
:) select * from pk_local;
┌─id─┬─name─┬─age─┬─created_at─┬─updated_at─┐
│  2 │ old  │  12 │ 1998-03-11 │ 1998-03-11 │
└────┴──────┴─────┴────────────┴────────────┘
┌─id─┬─name─┬─age─┬─created_at─┬─updated_at─┐
│  2 │ new  │  12 │ 1998-03-11 │ 1998-03-13 │
└────┴──────┴─────┴────────────┴────────────┘
:) optimize table pk_local;
:) select * from pk_local;

┌─id─┬─name─┬─age─┬─created_at─┬─updated_at─┐
│  2 │ new  │  12 │ 1998-03-11 │ 1998-03-13 │
└────┴──────┴─────┴────────────┴────────────┘