Hello,
we have observed that the aggregate merge results (from the combination of -State and -Merge functions) in a materialized view are accurate only when grouping or filtering by fields that are contained within the primary key of the view. When filtering (either via -If or via where combination) the calculated numbers are vastly different (orders of magnitude) compared to the ones calculated on the original table.
Is this expected behavior? Unfortunately I do not have time at the moment to try and setup a toy example, but if someone is interested I would be able to show this on our data.
It sounds like a problem that should be investigated.
But, It isn't easy to understand the problem without an example.
The AggregatingMergeTree allows you to create MV with more fields than are in the primary key.
Like:
CREATE MATERIALIZED VIEW ... AggregatingMergeTree((date), (date, page))
AS SELECT date, page, src, countState() AS queryCount
FROM source
GROUP BY date, page, src
It will surprisingly work fine until the first merge when it will somehow miss most of the countState,
without throwing any errors. Maybe you made the same mistake I did and primary key is inconsistent with the grouping?
Hi vavrusa,
yes, this is the case for me as well. I am wondering whether this is a bug of the engine or a limitation that should be documented (and perhaps, if this is actually the criterion behind, a query on a MV that is known to return wrong results should throw an error beforehand?).
edit: to expand on my comment - what I mean is that if the -State functions combine correctly only when grouped by fields that are also in the primary key, I believe the query should not be allowed at all
Looks like I have the same problem using SummingMergeTree.
@ludv1x, do you have any plans to change this behaviour or state it in the documentaion?
ClickHouse replaces all rows with the same primary key (or more accurately, with the same sorting key) with a single row (within a one data part) that stores a combination of states of aggregate functions.
Cf: https://clickhouse.yandex/docs/en/operations/table_engines/aggregatingmergetree/
For me, the documentation is clear enough when you know what you're looking for.
Do you propose to add a warning in the documentation to explain that if some keys are missing from the ORDER BY clause in the AggregatingMergeTree definition, they'll cause any subsequent issues when using those missing fields in GROUP BY clause later on the AggregatingMergeTree table?
Or is the documentation clear enough for everyone and we can close this issue?
I think the docs should be updated to be explicit about requiring the all the key fields in the GROUP BY clause. I ran into this when I was first using it too.
Following up here, in addition to the data being inconsistent in the result, it's also completely nonsensical and should throw an error. Take below for example. We're filtering on hour, but incorrectly were grouping by more than what's in our sorting key: (sourceID, hour). The first query correctly returns no results, however the subsequent query returns pretty much random data from the table/materialized view.
clickhouse-staging-0f6a61f82b85b6edf :) ...
SELECT
sourceID,
type,
name,
hour,
sumMerge(countState) AS seenCountByEvent,
sumMerge(enforcedEventsCountState) AS enforcedEventsCountByEvent
FROM default.hourly_events
WHERE (hour < '2019-04-29 00:00:00') AND (sourceID = '0xdeadbeef')
GROUP BY (sourceID, hour, type, name)
Ok.
0 rows in set. Elapsed: 0.002 sec.
clickhouse-staging-0f6a61f82b85b6edf :) ...
SELECT
sourceID,
type,
name,
hour,
sumMerge(countState) AS seenCountByEvent,
sumMerge(enforcedEventsCountState) AS enforcedEventsCountByEvent
FROM default.hourly_events
WHERE (hour < '2019-04-29 00:00:00') AND sourceID = '0xdeadbeef'
GROUP BY (sourceID, hour, type, name)
ββsourceIDββββ¬βtypeβββ¬βnameβββββββββββββββββββββββββ¬ββββββββββββββββhourββ¬βseenCountByEventββ¬βenforcedEventsCountByEventβββ
β 0xdeadbeef β page β null β 2019-05-01 02:00:00 β 5099 β 5099 β
β 0xdeadbeef β page β Lorem Ipsum β 2019-05-01 01:00:00 β 7160 β 7160 β
β 0xdeadbeef β track β Dolor Sit β 2019-05-01 03:00:00 β 1 β 1 β
β 0xdeadbeef β track β Consectetur Adipiscing β 2019-05-01 03:00:00 β 1247 β 1247 β
β 0xdeadbeef β track β Eiusmod Tempor β 2019-05-01 00:00:00 β 1995 β 1995 β
ββββββββββββββ΄ββββββββ΄ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
5 rows in set. Elapsed: 0.004 sec. Processed 4.33 thousand rows, 393.95 KB (1.10 million rows/s., 100.27 MB/s.)
The code here suggests that it shouldn't be possible to GROUP BY in a select query where the fields in the group by aren't included in the original primary key for the materialized view:
@ludv1x Are you going to do something about it? It's actually a problem. At least, we have to notice about this in the documentation about AggregatingMergeTree.
Same issue here. Resolved when GROUP BY clause contains same columns as view sorting key.
Most helpful comment
I think the docs should be updated to be explicit about requiring the all the key fields in the
GROUP BYclause. I ran into this when I was first using it too.