Clickhouse: Accuracy of AggregatingMergeTree Materialized View outside of primary key

Created on 2 May 2017 · 10Comments · Source: ClickHouse/ClickHouse

Hello,

we have observed that the aggregate merge results (from the combination of -State and -Merge functions) in a materialized view are accurate only when grouping or filtering by fields that are contained within the primary key of the view. When filtering (either via -If or via where combination) the calculated numbers are vastly different (orders of magnitude) compared to the ones calculated on the original table.

Is this expected behavior? Unfortunately I do not have time at the moment to try and setup a toy example, but if someone is interested I would be able to show this on our data.

usability

Source

andrea-s

👍4

Most helpful comment

I think the docs should be updated to be explicit about requiring the all the key fields in the GROUP BY clause. I ran into this when I was first using it too.

abraithwaite on 28 Feb 2019

👍2

All 10 comments

It sounds like a problem that should be investigated.
But, It isn't easy to understand the problem without an example.

ludv1x on 2 May 2017

The AggregatingMergeTree allows you to create MV with more fields than are in the primary key.
Like:

CREATE MATERIALIZED VIEW ... AggregatingMergeTree((date), (date, page))
AS SELECT date, page, src, countState() AS queryCount
FROM source
GROUP BY date, page, src

It will surprisingly work fine until the first merge when it will somehow miss most of the countState,
without throwing any errors. Maybe you made the same mistake I did and primary key is inconsistent with the grouping?

vavrusa on 2 May 2017

Hi vavrusa,

yes, this is the case for me as well. I am wondering whether this is a bug of the engine or a limitation that should be documented (and perhaps, if this is actually the criterion behind, a query on a MV that is known to return wrong results should throw an error beforehand?).

edit: to expand on my comment - what I mean is that if the -State functions combine correctly only when grouped by fields that are also in the primary key, I believe the query should not be allowed at all

andrea-s on 3 May 2017

Looks like I have the same problem using SummingMergeTree.
@ludv1x, do you have any plans to change this behaviour or state it in the documentaion?

Klimmy on 2 Mar 2018

ClickHouse replaces all rows with the same primary key (or more accurately, with the same sorting key) with a single row (within a one data part) that stores a combination of states of aggregate functions.

Cf: https://clickhouse.yandex/docs/en/operations/table_engines/aggregatingmergetree/

For me, the documentation is clear enough when you know what you're looking for.

Do you propose to add a warning in the documentation to explain that if some keys are missing from the ORDER BY clause in the AggregatingMergeTree definition, they'll cause any subsequent issues when using those missing fields in GROUP BY clause later on the AggregatingMergeTree table?

Or is the documentation clear enough for everyone and we can close this issue?

PoulpiFr on 28 Jan 2019

I think the docs should be updated to be explicit about requiring the all the key fields in the GROUP BY clause. I ran into this when I was first using it too.

abraithwaite on 28 Feb 2019

👍2

Following up here, in addition to the data being inconsistent in the result, it's also completely nonsensical and should throw an error. Take below for example. We're filtering on hour, but incorrectly were grouping by more than what's in our sorting key: (sourceID, hour). The first query correctly returns no results, however the subsequent query returns pretty much random data from the table/materialized view.

clickhouse-staging-0f6a61f82b85b6edf :) ...

SELECT
    sourceID,
    type,
    name,
    hour,
    sumMerge(countState) AS seenCountByEvent,
    sumMerge(enforcedEventsCountState) AS enforcedEventsCountByEvent
FROM default.hourly_events
WHERE (hour < '2019-04-29 00:00:00') AND (sourceID = '0xdeadbeef')
GROUP BY (sourceID, hour, type, name)

Ok.

0 rows in set. Elapsed: 0.002 sec.

clickhouse-staging-0f6a61f82b85b6edf :) ...

SELECT
    sourceID,
    type,
    name,
    hour,
    sumMerge(countState) AS seenCountByEvent,
    sumMerge(enforcedEventsCountState) AS enforcedEventsCountByEvent
FROM default.hourly_events
WHERE (hour < '2019-04-29 00:00:00') AND sourceID = '0xdeadbeef'
GROUP BY (sourceID, hour, type, name)

┌─sourceID───┬─type──┬─name────────────────────────┬────────────────hour─┬─seenCountByEvent─┬─enforcedEventsCountByEvent──┐
│ 0xdeadbeef │ page  │ null                        │ 2019-05-01 02:00:00 │             5099 │                        5099 │
│ 0xdeadbeef │ page  │ Lorem Ipsum                 │ 2019-05-01 01:00:00 │             7160 │                        7160 │
│ 0xdeadbeef │ track │ Dolor Sit                   │ 2019-05-01 03:00:00 │                1 │                           1 │
│ 0xdeadbeef │ track │ Consectetur Adipiscing      │ 2019-05-01 03:00:00 │             1247 │                        1247 │
│ 0xdeadbeef │ track │ Eiusmod Tempor              │ 2019-05-01 00:00:00 │             1995 │                        1995 │
└────────────┴───────┴─────────────────────────────┴─────────────────────┴──────────────────┴─────────────────────────────┘

5 rows in set. Elapsed: 0.004 sec. Processed 4.33 thousand rows, 393.95 KB (1.10 million rows/s., 100.27 MB/s.)

abraithwaite on 1 May 2019

👍1

The code here suggests that it shouldn't be possible to GROUP BY in a select query where the fields in the group by aren't included in the original primary key for the materialized view:

https://github.com/yandex/ClickHouse/blob/83a35d7644cad823533c3a6459c8c777f9d10481/dbms/src/DataStreams/AggregatingSortedBlockInputStream.cpp#L31-L43

abraithwaite on 1 May 2019

@ludv1x Are you going to do something about it? It's actually a problem. At least, we have to notice about this in the documentation about AggregatingMergeTree.