CREATE TABLE test
(
date Date,
x Int32,
ver UInt64
)
ENGINE = MergeTree(date, x, 4096);
INSERT INTO test VALUES ('2018-03-21', 1, 1), ('2018-03-21', 1, 2);
CREATE TABLE replacing ENGINE = ReplacingMergeTree(date, x, 4096, ver) AS SELECT * FROM test;
OPTIMIZE TABLE replacing PARTITION 201803 FINAL;
Actual results:
SELECT x FROM replacing
โโxโโ
โ 1 โ
โ 1 โ
โโโโโ
SELECT x FROM replacing FINAL
โโxโโ
โ 1 โ
โ 1 โ
โโโโโ
Expected in both cases:
โโxโโ
โ 1 โ
โโโโโ
If i understand it correctly:
1) With one insert you create one part.
2) Duplicates are removed at the moment of MERGING different parts
3) OPTIMIZE forces merge process (you already have one part so it do nothing)
4) FINAL apply the same logic as merge process (so replace old version with newer from different parts)
So general workaround is: when you inserting one block of data - apply any logic of overwriting rows before insert. If you need to make some changes in already stored data - when do another insert with newer version and it will work. So your test case should work if you will split one insert to 2.
@filimonov
If you specify FINAL, optimization will be performed even when all the data is already in one part.
Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.
This is considered Ok because it's expected that data modifications are inserted in separate INSERTions (so the data that should be replaced/collapsed/aggregated/etc is located in multiple different data parts).
But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.
Committed in master.
Hmm... Wouldn't it lead to big performance downgrade in some scenarios? Nop operation will be changed to big remerge?
May be it will be safer to make separate syntax for that case? Like
OPTIMIZE ... FINAL REMERGE
-- or
OPTIMIZE ... FINAL SETTINGS optimize_single_part = 1
-- or something like that?
OPTIMIZE FINAL is already doing re-merge even in case of single part (as it is documented).
Initially, it was intended for GraphiteMergeTree. But for other MergeTrees, it was read and rewrite all data on disk, doing no transforms. This case for OPTIMIZE FINAL was almost useless (actually it is useful to materialize columns with complex DEFAULT expressions on disk after adding them with ALTER).
We can introduce additional flag to disable re-merge in case of single part.
I'm not sure if i understand and remember that properly, but i think on telegram chat somebody mentioned that they use ReplacingMergeTree and regularly fire OPTIMIZE FINAL (in cron or something like that), to have the state of table consistent. That's not too smart, but could work. After that change that will make huge load for useless single part 'remerging', and be a bad 'surprise' after update. So may be additional flag would be safer?
OPTIMIZE FINAL is already doing full data rewrite at that moment.
Please check the commit. It only removes optimization of calculations inside merge.
@filimonov what filimonov say is right.
:) drop table pk_local;
:) CREATE TABLE default.pk_local ( id Int64, name String, age Int64, created_at Date, updated_at Date) ENGINE = ReplacingMergeTree(updated_at) partition by created_at order by id settings enable_vertical_merge_algorithm = 1, vertical_merge_algorithm_min_rows_to_activate = 1, vertical_merge_algorithm_min_columns_to_activate = 1;
:) insert into pk_local values(2, 'old', 12, '1998-03-11', '1998-03-11');
-- ๅ
ณ้ฎๆๅจ๏ผๆๅ
ฅไธๆกไนๅ้่ฆๅไธๆฌกไผๅ่กจ
-- the most important sql. do optimize after insert every time, then replacing will work
:) optimize table pk_local;
:) insert into pk_local values(2, 'new', 12, '1998-03-11', '1998-03-13');
:) select * from pk_local;
โโidโโฌโnameโโฌโageโโฌโcreated_atโโฌโupdated_atโโ
โ 2 โ old โ 12 โ 1998-03-11 โ 1998-03-11 โ
โโโโโโดโโโโโโโดโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ
โโidโโฌโnameโโฌโageโโฌโcreated_atโโฌโupdated_atโโ
โ 2 โ new โ 12 โ 1998-03-11 โ 1998-03-13 โ
โโโโโโดโโโโโโโดโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ
:) optimize table pk_local;
:) select * from pk_local;
โโidโโฌโnameโโฌโageโโฌโcreated_atโโฌโupdated_atโโ
โ 2 โ new โ 12 โ 1998-03-11 โ 1998-03-13 โ
โโโโโโดโโโโโโโดโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ
ๅฆๆๆฏไธญๅฝ็ๅๅญฆ๏ผๅฏไปฅๅ ๅพฎไฟก่ฎจ่ฎบ wu-shuiyong๏ผ่ฏท้ไธๅๅญๅๅ
ฌๅธ๏ผ่ฐข่ฐข๏ผ
Most helpful comment
Recently FINAL (both SELECT with FINAL and OPTIMIZE with FINAL) wasn't doing any transformations while merging in the case of single data part.
This is considered Ok because it's expected that data modifications are inserted in separate INSERTions (so the data that should be replaced/collapsed/aggregated/etc is located in multiple different data parts).
But for convenience, I have modified this behaviour. Now data is merged in case of FINAL even if there is just one part.
Committed in master.