Hello,
I am finding some strange behaviour with a replicatedreplacingmergetree. I need to store time series data and can not have duplicates. I've tried a simple case of loading the exact same dataset of ~1.5million rows twice to test how well it rejects duplicates.
Result - select count() from table= ~2.3million rows, so lots of duplicates. Select count() from table FINAL - exactly the right number.
So duplicates are being correctly flagged but how do I optimize the database so that they no longer exist without using FINAL? I have tried optimize table
I understand that optimize should be an intensive process of remerging to sort this out? I'm happy with that as it could be run irregularly as duplicates won't occur in normal service.
How do I force optimize?
Thanks for such an impressive system!
Further testing - using ReplacingMergeTree on a single non-replicated machine I found the optimize table
Logs don't suggest any reason for this. Each time I run it I see it do some tidying, so not sure what is going on. Just running it once and leaving it alone doesn't seem to work, 2 or 3 times will see the count(*) reducing but still not fully deduplicated. 10+ times seems to deduplicate completely on this dataset.
Further - Trying again on a ReplicatedReplacingMergeTree across two servers. Optimize returns OK on one machine, but doesn't do anything, and on the 2nd machine it complains that the user password is not correct (use the same users.xml, and all DDL 'on cluster' statements work fine, and the data has replicated, so not sure why?)
Either way, I can't optimize a replicatedreplacingmergetree. I can't get rid of the duplicates.
Temporary solution:
Select
partition,
partCount
from(
SELECT
partition,
count(*) as partCount
FROM
system.parts
WHERE
active
AND (table = 'TABLE')
group by
partition
)
where
partCount > 1
And run optimize table
Most helpful comment
After successful run of
that was done after replication is done I would expect no duplicates should be guaranteed.