Clickhouse: Creating a materialized view for calculating histogram data

Created on 18 Dec 2019 · 4Comments · Source: ClickHouse/ClickHouse

I have created a table:

CREATE TABLE results
(
    id UUID,
    date_time DateTime,
    item_id UInt32,
    value UInt16
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id);

And I want to create a materialized view to store hourly histogram data for value. For example;

I expect an output like this:

toStartOfHour          item_id    value    count
2019-12-18 00:00:00    1          0        4       /* number of rows with value between 0 and 100 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */
2019-12-18 00:00:00    1          100      7       /* number of rows with value between 100 and 200 and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00 */

Number of rows with value between 100 and 0, and date_time between 2019-12-18 00:00:00 and 2019-12-18 01:00:00. I have tried something like this:

CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
          item_id,
          multiply(floor(value / 100), 100) AS value,
          countState() AS count
FROM results
GROUP BY date_time,
         item_id,
         value;

This materialized view definition works when populated. But with time and new rows, it gets wrong. How wrong? I don't know. I couldn't find a pattern.

I am not sure if I found a bug on clickhouse or I am doing something wrong.

Is my materialized view definition correct?

question

Source

th0th

Most helpful comment

ENGINE = AggregatingMergeTree() ... ORDER BY (date_time, item_id)

Should be ORDER BY (date_time, item_id, value)

In your case value have been corrupted during merge -- reduced by any function over (date_time, item_id)

You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree

den-crane on 18 Dec 2019

👍2

All 4 comments

ENGINE = AggregatingMergeTree() ... ORDER BY (date_time, item_id)

Should be ORDER BY (date_time, item_id, value)

In your case value have been corrupted during merge -- reduced by any function over (date_time, item_id)

You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree

den-crane on 18 Dec 2019

👍2

Though maybe you had another idea. Pivoted table with counts in array in a row.
It can be achieved using https://clickhouse.yandex/docs/en/operations/table_engines/summingmergetree/#nested-structures

den-crane on 18 Dec 2019

👍1

ENGINE = AggregatingMergeTree() ... ORDER BY (date_time, item_id)

Should be ORDER BY (date_time, item_id, value)

In your case value have been corrupted during merge -- reduced by any function over (date_time, item_id)

You don't need AggregatingMergeTree + countState -- it's overkill
SummingMergeTree / sum(toUInt64(1)) as cnt is enough.
Or simpleAggregationFunction sum + AggregatingMergeTree

Thanks for the reply @den-crane! I've recreated the materialized view with the updated ORDER BY part. It seems to be correctly populated from the existing data. Hope it will work with the newly generated data, too 🤞

Can you please share CREATE TABLE query example for SummingMergeTree / sum(toUInt64(1)) as count and "simpleAggregationFunction sum + AggregatingMergeTree"?

th0th on 19 Dec 2019

SummingMergeTree

CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id, value)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
          item_id,
          intDiv(value,100) * 100 AS value,
          sum(toUInt64(1)) AS count
FROM results
GROUP BY date_time,
         item_id,
         value;

AggregatingMergeTree + simpleAggregationFunction

CREATE MATERIALIZED VIEW results_histogram_by_hour
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMMDD(date_time)
ORDER BY (date_time, item_id, value)
POPULATE
AS SELECT toStartOfHour(date_time) AS date_time,
          item_id,
          intDiv(value,100) * 100 AS value,
          cast(sum(toUInt64(1)), 'SimpleAggregateFunction(SUM, Int64)') AS count
FROM results
GROUP BY date_time,
         item_id,
         value;