Clickhouse: OPTIMIZE FINAL doesn't work correctly.

Created on 21 Mar 2018  ยท  8Comments  ยท  Source: ClickHouse/ClickHouse

CREATE TABLE test
(
    date Date,
    x Int32,
    ver UInt64
)
ENGINE = MergeTree(date, x, 4096);

INSERT INTO test VALUES ('2018-03-21', 1, 1), ('2018-03-21', 1, 2);

CREATE TABLE replacing ENGINE = ReplacingMergeTree(date, x, 4096, ver) AS SELECT * FROM test;

OPTIMIZE TABLE replacing PARTITION 201803 FINAL;

Actual results:

SELECT x FROM replacing 

โ”Œโ”€xโ”€โ”
โ”‚ 1 โ”‚
โ”‚ 1 โ”‚
โ””โ”€โ”€โ”€โ”˜

SELECT x FROM replacing FINAL 

โ”Œโ”€xโ”€โ”
โ”‚ 1 โ”‚
โ”‚ 1 โ”‚
โ””โ”€โ”€โ”€โ”˜

Expected in both cases:

โ”Œโ”€xโ”€โ”
โ”‚ 1 โ”‚
โ””โ”€โ”€โ”€โ”˜
enhancement

Most helpful comment

Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.

This is considered Ok because it's expected that data modifications are inserted in separate INSERTions (so the data that should be replaced/collapsed/aggregated/etc is located in multiple different data parts).

But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.

Committed in master.

All 8 comments

If i understand it correctly:
1) With one insert you create one part.
2) Duplicates are removed at the moment of MERGING different parts
3) OPTIMIZE forces merge process (you already have one part so it do nothing)
4) FINAL apply the same logic as merge process (so replace old version with newer from different parts)

So general workaround is: when you inserting one block of data - apply any logic of overwriting rows before insert. If you need to make some changes in already stored data - when do another insert with newer version and it will work. So your test case should work if you will split one insert to 2.

@filimonov

If you specify FINAL, optimization will be performed even when all the data is already in one part.

https://clickhouse.yandex/docs/en/single/#optimize

Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.

This is considered Ok because it's expected that data modifications are inserted in separate INSERTions (so the data that should be replaced/collapsed/aggregated/etc is located in multiple different data parts).

But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.

Committed in master.

Hmm... Wouldn't it lead to big performance downgrade in some scenarios? Nop operation will be changed to big remerge?

May be it will be safer to make separate syntax for that case? Like

OPTIMIZE ... FINAL REMERGE
-- or
OPTIMIZE ... FINAL SETTINGS optimize_single_part = 1
-- or something like that?

OPTIMIZE FINAL is already doing re-merge even in case of single part (as it is documented).

Initially, it was intended for GraphiteMergeTree. But for other MergeTrees, it was read and rewrite all data on disk, doing no transforms. This case for OPTIMIZE FINAL was almost useless (actually it is useful to materialize columns with complex DEFAULT expressions on disk after adding them with ALTER).

We can introduce additional flag to disable re-merge in case of single part.

I'm not sure if i understand and remember that properly, but i think on telegram chat somebody mentioned that they use ReplacingMergeTree and regularly fire OPTIMIZE FINAL (in cron or something like that), to have the state of table consistent. That's not too smart, but could work. After that change that will make huge load for useless single part 'remerging', and be a bad 'surprise' after update. So may be additional flag would be safer?

OPTIMIZE FINAL is already doing full data rewrite at that moment.
Please check the commit. It only removes optimization of calculations inside merge.

@filimonov what filimonov say is right.

:) drop table pk_local;

:) CREATE TABLE default.pk_local ( id Int64,  name String,  age Int64, created_at Date,  updated_at Date) ENGINE = ReplacingMergeTree(updated_at) partition by created_at order by id settings enable_vertical_merge_algorithm = 1, vertical_merge_algorithm_min_rows_to_activate = 1, vertical_merge_algorithm_min_columns_to_activate = 1;

:) insert into pk_local values(2, 'old', 12, '1998-03-11', '1998-03-11');
-- ๅ…ณ้”ฎๆ‰€ๅœจ๏ผšๆ’ๅ…ฅไธ€ๆกไน‹ๅŽ้œ€่ฆๅšไธ€ๆฌกไผ˜ๅŒ–่กจ
-- the most important sql. do optimize after insert every time, then replacing will work
:) optimize table pk_local;
:) insert into pk_local values(2, 'new', 12, '1998-03-11', '1998-03-13');
:) select * from pk_local;
โ”Œโ”€idโ”€โ”ฌโ”€nameโ”€โ”ฌโ”€ageโ”€โ”ฌโ”€created_atโ”€โ”ฌโ”€updated_atโ”€โ”
โ”‚  2 โ”‚ old  โ”‚  12 โ”‚ 1998-03-11 โ”‚ 1998-03-11 โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€idโ”€โ”ฌโ”€nameโ”€โ”ฌโ”€ageโ”€โ”ฌโ”€created_atโ”€โ”ฌโ”€updated_atโ”€โ”
โ”‚  2 โ”‚ new  โ”‚  12 โ”‚ 1998-03-11 โ”‚ 1998-03-13 โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
:) optimize table pk_local;
:) select * from pk_local;

โ”Œโ”€idโ”€โ”ฌโ”€nameโ”€โ”ฌโ”€ageโ”€โ”ฌโ”€created_atโ”€โ”ฌโ”€updated_atโ”€โ”
โ”‚  2 โ”‚ new  โ”‚  12 โ”‚ 1998-03-11 โ”‚ 1998-03-13 โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

่”็ณปๆˆ‘

ๅฆ‚ๆžœๆ˜ฏไธญๅ›ฝ็š„ๅŒๅญฆ๏ผŒๅฏไปฅๅŠ ๅพฎไฟก่ฎจ่ฎบ wu-shuiyong๏ผˆ่ฏท้™„ไธŠๅๅญ—ๅ’Œๅ…ฌๅธ๏ผŒ่ฐข่ฐข๏ผ‰

Was this page helpful?
0 / 5 - 0 ratings

Related issues

igor-sh8 picture igor-sh8  ยท  3Comments

derekperkins picture derekperkins  ยท  3Comments

zhicwu picture zhicwu  ยท  3Comments

vvp83 picture vvp83  ยท  3Comments

innerr picture innerr  ยท  3Comments