Clickhouse: `select countMerge from ReplicatedAggregatingMergeTree` get doubled results

Created on 21 Oct 2020  路  3Comments  路  Source: ClickHouse/ClickHouse

Hi all.
Recently I started using clickhouse and I have some troubles.
I create local MV on local table containing complete data (called it tbl), and create distributed MV based on local MV.
I use cluster with 3 shards and each shard has an extra replication, thus there are 6 servers in total.

Local table containing complete data, or tbl uses ReplicatedMergeTree as engine. Local MV uses ReplicatedAggregatingMergeTree as engine.
In addition, I recreated every MVs on test database with POPULATE. In this way, there is no possible of duplicated insertion.

The problem is that,
when I select countMerge from distributed MV, I got twice the correct answer (i,e., if the correct answer is 50, I got 100.),
while select uniqExactMerge from distributed MV will give a correct result.

Here is the script:

CREATE DATABASE IF NOT EXISTS test ON CLUSTER cc_cluster;
-- local MV
CREATE MATERIALIZED VIEW
IF NOT EXISTS
test.user_event_stat_scene_mv_local_test_v2
ON CLUSTER cc_cluster
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{layer}-{shard}/test.user_event_stat_scene_mv_local_test_v2', '{replica}')
PARTITION BY (dt)
ORDER BY (dt, scene)
POPULATE
AS select
    countState(1) as expos,
    countState(if(click>0, 1, null)) as clicks,
    sumState(if(click=1, watch, 0)) as dr,
    countState(if(click=1 and tbl.cost_cnt>0, 1, null)) as cost_cnt,
    sumState(if(click=1, tbl.cost_total, 0)) as cost_total,
    countState(if(click=1 and tbl.chat_cnt>0, 1, null)) as chat_cnt,

    uniqExactState(recom_token) as item_expos,
    uniqExactState(if(click=1, recom_token, null)) as item_clicks,
    uniqExactState(uid) as user_expos,
    uniqExactState(if(click=1, uid, null)) as user_clicks,
    uniqExactState(if(click=1 and tbl.cost_cnt>0, uid, null)) as user_costs,
    uniqExactState(if(click=1 and tbl.chat_cnt>0, uid, null)) as user_chats,
    scene,
    toFixedString(dt, 8) as dt
FROM recom_stats_dws.user_event_log_day_local as tbl
GROUP BY dt, scene;

-- distributed MV
CREATE TABLE IF NOT EXISTS
test.user_event_stat_scene_mv_all_test_v2
ON CLUSTER cc_cluster
AS test.user_event_stat_scene_mv_local_test_v2
ENGINE = Distributed(cc_cluster, test, user_event_stat_scene_mv_local_test_v2, rand());


-- query
select  \
    countMerge(expos) as expos,  \
    countMerge(clicks) as clicks, \
    sumMerge(dr) as dr, \
    countMerge(cost_cnt) as cost_cnt, \
    sumMerge(cost_total) as cost_total, \
    countMerge(chat_cnt) as chat_cnt, \
    uniqExactMerge(item_expos) as item_expos, \
    uniqExactMerge(item_clicks) as item_clicks, \
    uniqExactMerge(user_expos) as user_expos, \
    uniqExactMerge(user_clicks) as user_clicks, \
    uniqExactMerge(user_costs) as user_costs, \
    uniqExactMerge(user_chats) as user_chats, \
    scene, \
    dt \
FROM test.user_event_stat_scene_mv_all_test_v2 as tbl \
GROUP BY dt, scene \
order by dt, scene;

PS:
local MV = test.user_event_stat_scene_mv_local_test_v2
distributed MV = test.user_event_stat_scene_mv_all_test_v2
local table containing complete data = recom_stats_dws.user_event_log_day_local

Which part may I did something wrong?
hope to get your help XD

question question-answered

All 3 comments

You have executed POPULATE twice. At each replica. It doubled the data.

You need to run POPULATE only at one replica.

on_cluster -- is just a helper. It executes this CREATE MAT.... at all nodes.

then CREATE MAT.... POPULATE is executed at node N1 it does POPULATE over all data
then CREATE MAT.... POPULATE is executed at node N2 (which is replica for N1) and it does POPULATE over all the same data as N1
and you get all duplicated in MV

I never use POPULATE becase it is unusable. You can insert data into MV by yourself.

@DouMiaoO-Oo I think you can close issue ;)

Was this page helpful?
0 / 5 - 0 ratings