I have created a table:
CREATE TABLE results
(
id UUID,
date_time DateTime,
item_id UInt32,
value UInt16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id);
And I want to create a materialized view to store hourly histogram data for value. For example;
I expect an output like this:
toStartOfHour item_id value count
2019-12-18 00:00:00 1 0 4 /* number of rows with value between 0 and 100 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */
2019-12-18 00:00:00 1 100 7 /* number of rows with value between 100 and 200 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */
Number of rows with value between 100 and 0, and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00. I have tried something like this:
CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
item_id,
multiply(floor(value / 100), 100) AS value,
countState() AS count
FROM results
GROUP BY date_time,
item_id,
value;
This materialized view definition works when populated. But with time and new rows, it gets wrong. How wrong? I don't know. I couldn't find a pattern.
I am not sure if I found a bug on clickhouse or I am doing something wrong.
Is my materialized view definition correct?
ENGINE = AggregatingMergeTree() ... ORDER BY (date_time, item_id)
Should be ORDER BY (date_time, item_id, value)
In your case value have been corrupted during merge -- reduced by any function over (date_time, item_id)
You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree
Though maybe you had another idea. Pivoted table with counts in array in a row.
It can be achieved using https://clickhouse.yandex/docs/en/operations/table_engines/summingmergetree/#nested-structures
ENGINE = AggregatingMergeTree() ... ORDER BY (date_time, item_id)
Should be ORDER BY (date_time, item_id, value)
In your case value have been corrupted during merge -- reduced by
anyfunction over (date_time, item_id)You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree
Thanks for the reply @den-crane! I've recreated the materialized view with the updated ORDER BY part. It seems to be correctly populated from the existing data. Hope it will work with the newly generated data, too 馃
Can you please share CREATE TABLE query example for SummingMergeTree / sum(toUInt64(1)) as count and "simpleAggregationFunction sum + AggregatingMergeTree"?
CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id, value)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
item_id,
intDiv(value,100) * 100 AS value,
sum(toUInt64(1)) AS count
FROM results
GROUP BY date_time,
item_id,
value;
CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id, value)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
item_id,
intDiv(value,100) * 100 AS value,
cast(sum(toUInt64(1)), 'SimpleAggregateFunction(SUM, Int64)') AS count
FROM results
GROUP BY date_time,
item_id,
value;
https://clickhouse.yandex/docs/en/operations/table_engines/summingmergetree/#nested-structures
SummingMergeTree able to sum values in K/V arrays (columns should be named ...Map -- valueMap)
https://stackoverflow.com/a/59402031/11644308
Most helpful comment
Should be ORDER BY (date_time, item_id, value)
In your case value have been corrupted during merge -- reduced by
anyfunction over (date_time, item_id)You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree