I'm trying to aggregate over the results of a query with LIMIT BY. Building upon the example from docs, I'm running the following query:
SELECT count() FROM limit_by LIMIT 1 BY id
This query results in the following error:
Code: 215. DB::Exception: Received from clickhouse:9000. DB::Exception: Column `id` is not under aggregate function and not in GROUP BY.
I'm getting the same result with any other aggregate function I tried using. Based on LIMIT BY documentation, I would expect this query to keep at most 1 row for each ID and then count the number of rows.
LIMIT BY has no sense without ORDER BYSELECT count(), id FROM limit_by , and it needs group by id section.Seems you need
SELECT uniqExact(id) FROM limit_by
ββuniqExact(id)ββ
β 2 β
βββββββββββββββββ
or the same in ANSI SQL
SELECT count(distinct id) FROM limit_by
ββuniqExact(id)ββ
β 2 β
βββββββββββββββββ
This query is ineffective , and needs select from select because does grouping by different columns (group by id) group by 1
SELECT count() from (select 1 FROM limit_by limit 1 by id)
ββcount()ββ
β 2 β
βββββββββββ
Thanks for the quick answer! Understanding that id is being added as a queried column helps, and the suggested query solves my question.
The underlying problem I'm trying to solve is slightly more complex β I'd like to perform uniqExact over one column, but aggregate over other columns:
SELECT countIf(val > 15), countIf(val2 > 20), ... FROM limit_by ORDER BY id LIMIT 1 BY id
Is there an easy way to do it?
Even more specifically (even though this should be a separate question), I'm trying to "undo" the effect of ARRAY JOIN. I have a table with an array and I use ARRAY JOIN for filtering, but then I'd like to aggregate over the original set of rows without the extra rows created by ARRAY JOIN:
SELECT countIf(cond1), countIf(cond2), ...
FROM tbl ARRAY JOIN arr WHERE arr IN (SELECT ...)
I was hoping to use LIMIT BY to remove the extra rows. This should be rather efficient, as I guess the table produced by ARRAY JOIN is already sorted appropriately for this. But is there perhaps a better solution?
Please provide the data sample and desired result.
I don't think you need limit by / ARRAY JOIN . From your description you need just count / group by
Details (I tried to simplify it and remove all unrelated stuff):
CREATE TABLE tbl (id UInt32, data UInt32, related Array(UInt32)) ENGINE MergeTree ORDER BY id;
INSERT INTO tbl VALUES (1, 1, []), (2, 2, [1]), (3, 3, [1]), (4, 4, [1, 2])
SELECT * FROM tbl
ββidββ¬βdataββ¬βrelatedββ
β 1 β 1 β [] β
β 2 β 2 β [1] β
β 3 β 3 β [1] β
β 4 β 4 β [1,2] β
ββββββ΄βββββββ΄ββββββββββ
Here, data field stores event data (in reality, I have many columns), and related field is a list of ids of other rows in this same table related to this row. The relation is directional. What I'm trying to accomplish with my query is to:
data < 3 in the example below).data > 2 and data > 3 in the example below β in reality, I have 10-30 sets of conditions).Here's my query:
SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl ARRAY JOIN related
WHERE related IN (SELECT id FROM tbl WHERE data<3)
The output is:
ββc1ββ¬βc2ββ
β 3 β 2 β
ββββββ΄βββββ
What I'd like to get is this:
ββc1ββ¬βc2ββ
β 2 β 1 β
ββββββ΄βββββ
The problem is that the event with id=4 is counted twice because it's row is duplicated by the ARRAY JOIN operation.
What I tried
The following query produces the expected outcome, but is about 4x slower on my data that the query above:
SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM (
SELECT * FROM tbl ARRAY JOIN related WHERE related IN (SELECT id FROM tbl WHERE data<3) LIMIT 1 BY id
)
I also tried to avoid using ARRAY JOIN altogether and replace it with hasAny:
SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl WHERE hasAny(related, (SELECT groupArray(id) FROM tbl WHERE data<3))
This works, but is VERY slow and actually runs our of memory on my dataset.
Probably there is no way to avoid arrayJoin
you can use uniqExactIf(id,data>2) (or uniqExactIf(data,data>2))
SELECT
uniqExactIf(id, data > 2) AS c1,
uniqExactIf(id, data > 3) AS c2
FROM tbl
WHERE arrayJoin(related) IN
(
SELECT id
FROM tbl
WHERE data < 3
)
ββc1ββ¬βc2ββ
β 2 β 1 β
ββββββ΄βββββ
this probably will be slower
SELECT
countIf(data > 2) AS c1,
countIf(data > 3) AS c2
FROM tbl
WHERE id IN
(
SELECT id
FROM tbl
WHERE arrayJoin(related) IN
(
SELECT id
FROM tbl
WHERE data < 3
)
)
ββc1ββ¬βc2ββ
β 2 β 1 β
ββββββ΄βββββ
Awesome, uniqExactIf is exactly what I needed! It made me appreciate the power of combinations, thanks.
In the meantime, I found another solution that almost works and is about 50% faster:
SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl ARRAY JOIN related
WHERE related IN (SELECT id FROM tbl WHERE data<3)
AND runningDifferenceStartingWithFirstValue(id) != 0
On this dataset, it still produces a wrong result, and on my large dataset if produces the result somewhere in-between the correct one and the one with duplicates.
What's weird is this inconsistency. When I add runningDifferenceStartingWithFirstValue to select, it works exactly as I'd expect β now if I'd filter it by !=0, I'd get my result without the need to use uniqExact:
SELECT
id,
runningDifferenceStartingWithFirstValue(id) AS d
FROM tbl
ARRAY JOIN related
WHERE related IN
(
SELECT id
FROM tbl AS src
WHERE data < 3
)
ββidββ¬βdββ
β 2 β 2 β
β 3 β 1 β
β 4 β 1 β
β 4 β 0 β
ββββββ΄ββββ
However, the moment I add this column to WHERE (keeping the rest of the query the same), the value of the output changes!
SELECT
id,
runningDifferenceStartingWithFirstValue(id) AS d
FROM tbl
ARRAY JOIN related
WHERE (related IN
(
SELECT id
FROM tbl AS src
WHERE data < 3
)) AND (d != 0)
ββidββ¬βdββ
β 2 β 1 β
β 3 β 1 β
β 4 β 1 β
β 4 β 1 β
ββββββ΄ββββ
How could this be possible?
Actually even more generic solution could be neighbor(id, -1) != id, but it has the same problem: it works in SELECT but not in WHERE.
hmm,
SELECT id
FROM
(
SELECT
id,
(runningDifferenceStartingWithFirstValue(id)) AS d
FROM tbl
ARRAY JOIN related
WHERE related IN
(
SELECT id
FROM tbl AS src
WHERE data < 3
)
)
WHERE d != 0
ββidββ
β 2 β
β 3 β
β 4 β
ββββββ
That indeed works! In fact, even this works:
SELECT *
FROM
(
SELECT *
FROM tbl AS dst
ARRAY JOIN related
WHERE related IN
(
SELECT id
FROM tbl AS src
WHERE data < 3
)
)
WHERE runningDifferenceStartingWithFirstValue(id) != 0
ββidββ¬βdataββ¬βrelatedββ
β 2 β 2 β 1 β
β 3 β 3 β 1 β
β 4 β 4 β 1 β
ββββββ΄βββββββ΄ββββββββββ
I think I understand why: I guess clickhouse executes all branches of WHERE in parallel, so when runningDifferenceStartingWithFirstValue is executed in the same query as ARRAY JOIN, it ends up being executed on a table state before array join. A subquery forces the order of execution and solves this problem.
Subquery does add overhead, it's slower than my original (and incorrect) one by ~20%. The version with uniqExactIf is faster when the number of times I invoke it is low, but with a list of 10-20 invocations it gets slower. I guess I'll go with a subquery and runningDifferenceStartingWithFirstValue for now. Perhaps I'll look at the source code later and see if there are any obvious ways to "undo" the arrayJoin there.
Thanks a lot for your help!
Most helpful comment
LIMIT BYhas no sense withoutORDER BYSELECT count(), id FROM limit_by, and it needsgroup by idsection.Seems you need
or the same in ANSI SQL
This query is ineffective , and needs
select from selectbecause does grouping by different columns(group by id) group by 1