Clickhouse: LIMIT BY clause does not work with aggregations

Created on 14 Jul 2020  Β·  9Comments  Β·  Source: ClickHouse/ClickHouse

I'm trying to aggregate over the results of a query with LIMIT BY. Building upon the example from docs, I'm running the following query:

SELECT count() FROM limit_by LIMIT 1 BY id

This query results in the following error:

Code: 215. DB::Exception: Received from clickhouse:9000. DB::Exception: Column `id` is not under aggregate function and not in GROUP BY.

I'm getting the same result with any other aggregate function I tried using. Based on LIMIT BY documentation, I would expect this query to keep at most 1 row for each ID and then count the number of rows.

performance question question-answered

Most helpful comment

  1. LIMIT BY has no sense without ORDER BY
  2. CH SQL allows to omit grouped columns in select section but in reality you query does SELECT count(), id FROM limit_by , and it needs group by id section.

Seems you need

SELECT uniqExact(id) FROM limit_by
β”Œβ”€uniqExact(id)─┐
β”‚             2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

or the same in ANSI SQL

SELECT count(distinct id) FROM limit_by
β”Œβ”€uniqExact(id)─┐
β”‚             2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This query is ineffective , and needs select from select because does grouping by different columns (group by id) group by 1

SELECT count() from (select 1 FROM limit_by limit 1 by id)
β”Œβ”€count()─┐
β”‚       2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

All 9 comments

  1. LIMIT BY has no sense without ORDER BY
  2. CH SQL allows to omit grouped columns in select section but in reality you query does SELECT count(), id FROM limit_by , and it needs group by id section.

Seems you need

SELECT uniqExact(id) FROM limit_by
β”Œβ”€uniqExact(id)─┐
β”‚             2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

or the same in ANSI SQL

SELECT count(distinct id) FROM limit_by
β”Œβ”€uniqExact(id)─┐
β”‚             2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This query is ineffective , and needs select from select because does grouping by different columns (group by id) group by 1

SELECT count() from (select 1 FROM limit_by limit 1 by id)
β”Œβ”€count()─┐
β”‚       2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Thanks for the quick answer! Understanding that id is being added as a queried column helps, and the suggested query solves my question.

The underlying problem I'm trying to solve is slightly more complex – I'd like to perform uniqExact over one column, but aggregate over other columns:

SELECT countIf(val > 15), countIf(val2 > 20), ... FROM limit_by ORDER BY id LIMIT 1 BY id

Is there an easy way to do it?

Even more specifically (even though this should be a separate question), I'm trying to "undo" the effect of ARRAY JOIN. I have a table with an array and I use ARRAY JOIN for filtering, but then I'd like to aggregate over the original set of rows without the extra rows created by ARRAY JOIN:

SELECT countIf(cond1), countIf(cond2), ...
FROM tbl ARRAY JOIN arr WHERE arr IN (SELECT ...)

I was hoping to use LIMIT BY to remove the extra rows. This should be rather efficient, as I guess the table produced by ARRAY JOIN is already sorted appropriately for this. But is there perhaps a better solution?

Please provide the data sample and desired result.
I don't think you need limit by / ARRAY JOIN . From your description you need just count / group by

Details (I tried to simplify it and remove all unrelated stuff):

CREATE TABLE tbl (id UInt32, data UInt32, related Array(UInt32)) ENGINE MergeTree ORDER BY id;
INSERT INTO tbl VALUES (1, 1, []), (2, 2, [1]), (3, 3, [1]), (4, 4, [1, 2])
SELECT * FROM tbl
β”Œβ”€id─┬─data─┬─related─┐
β”‚  1 β”‚    1 β”‚ []      β”‚
β”‚  2 β”‚    2 β”‚ [1]     β”‚
β”‚  3 β”‚    3 β”‚ [1]     β”‚
β”‚  4 β”‚    4 β”‚ [1,2]   β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Here, data field stores event data (in reality, I have many columns), and related field is a list of ids of other rows in this same table related to this row. The relation is directional. What I'm trying to accomplish with my query is to:

  1. find "source" events matching a specific condition (data < 3 in the example below).
  2. find "destination" events related to these source events
  3. count number of matching destination events that additionally satisfy their own conditions (data > 2 and data > 3 in the example below – in reality, I have 10-30 sets of conditions).

Here's my query:

SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl ARRAY JOIN related
WHERE related IN (SELECT id FROM tbl WHERE data<3)

The output is:

β”Œβ”€c1─┬─c2─┐
β”‚  3 β”‚  2 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

What I'd like to get is this:

β”Œβ”€c1─┬─c2─┐
β”‚  2 β”‚  1 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

The problem is that the event with id=4 is counted twice because it's row is duplicated by the ARRAY JOIN operation.

What I tried

The following query produces the expected outcome, but is about 4x slower on my data that the query above:

SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM (
  SELECT * FROM tbl ARRAY JOIN related WHERE related IN (SELECT id FROM tbl WHERE data<3) LIMIT 1 BY id
)

I also tried to avoid using ARRAY JOIN altogether and replace it with hasAny:

SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl WHERE hasAny(related, (SELECT groupArray(id) FROM tbl WHERE data<3))

This works, but is VERY slow and actually runs our of memory on my dataset.

Probably there is no way to avoid arrayJoin

you can use uniqExactIf(id,data>2) (or uniqExactIf(data,data>2))

SELECT
    uniqExactIf(id, data > 2) AS c1,
    uniqExactIf(id, data > 3) AS c2
FROM tbl
WHERE arrayJoin(related) IN
(
    SELECT id
    FROM tbl
    WHERE data < 3
)
β”Œβ”€c1─┬─c2─┐
β”‚  2 β”‚  1 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

this probably will be slower

SELECT
    countIf(data > 2) AS c1,
    countIf(data > 3) AS c2
FROM tbl
WHERE id IN
(
    SELECT id
    FROM tbl
    WHERE arrayJoin(related) IN
    (
        SELECT id
        FROM tbl
        WHERE data < 3
    )
)

β”Œβ”€c1─┬─c2─┐
β”‚  2 β”‚  1 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

Awesome, uniqExactIf is exactly what I needed! It made me appreciate the power of combinations, thanks.

In the meantime, I found another solution that almost works and is about 50% faster:

SELECT countIf(data>2) AS c1, countIf(data>3) AS c2
FROM tbl ARRAY JOIN related
WHERE related IN (SELECT id FROM tbl WHERE data<3)
  AND runningDifferenceStartingWithFirstValue(id) != 0

On this dataset, it still produces a wrong result, and on my large dataset if produces the result somewhere in-between the correct one and the one with duplicates.

What's weird is this inconsistency. When I add runningDifferenceStartingWithFirstValue to select, it works exactly as I'd expect – now if I'd filter it by !=0, I'd get my result without the need to use uniqExact:

SELECT 
    id,
    runningDifferenceStartingWithFirstValue(id) AS d
FROM tbl
ARRAY JOIN related
WHERE related IN 
(
    SELECT id
    FROM tbl AS src
    WHERE data < 3
)

β”Œβ”€id─┬─d─┐
β”‚  2 β”‚ 2 β”‚
β”‚  3 β”‚ 1 β”‚
β”‚  4 β”‚ 1 β”‚
β”‚  4 β”‚ 0 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”˜

However, the moment I add this column to WHERE (keeping the rest of the query the same), the value of the output changes!

SELECT 
    id,
    runningDifferenceStartingWithFirstValue(id) AS d
FROM tbl
ARRAY JOIN related
WHERE (related IN 
(
    SELECT id
    FROM tbl AS src
    WHERE data < 3
)) AND (d != 0)

β”Œβ”€id─┬─d─┐
β”‚  2 β”‚ 1 β”‚
β”‚  3 β”‚ 1 β”‚
β”‚  4 β”‚ 1 β”‚
β”‚  4 β”‚ 1 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”˜

How could this be possible?

Actually even more generic solution could be neighbor(id, -1) != id, but it has the same problem: it works in SELECT but not in WHERE.

hmm,

SELECT id
FROM
(
    SELECT
        id,
        (runningDifferenceStartingWithFirstValue(id)) AS d
    FROM tbl
    ARRAY JOIN related
    WHERE related IN
    (
        SELECT id
        FROM tbl AS src
        WHERE data < 3
    )
)
WHERE d != 0

β”Œβ”€id─┐
β”‚  2 β”‚
β”‚  3 β”‚
β”‚  4 β”‚
β””β”€β”€β”€β”€β”˜

That indeed works! In fact, even this works:

SELECT *
FROM 
(
    SELECT *
    FROM tbl AS dst
    ARRAY JOIN related
    WHERE related IN 
    (
        SELECT id
        FROM tbl AS src
        WHERE data < 3
    )
)
WHERE runningDifferenceStartingWithFirstValue(id) != 0

β”Œβ”€id─┬─data─┬─related─┐
β”‚  2 β”‚    2 β”‚       1 β”‚
β”‚  3 β”‚    3 β”‚       1 β”‚
β”‚  4 β”‚    4 β”‚       1 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I think I understand why: I guess clickhouse executes all branches of WHERE in parallel, so when runningDifferenceStartingWithFirstValue is executed in the same query as ARRAY JOIN, it ends up being executed on a table state before array join. A subquery forces the order of execution and solves this problem.

Subquery does add overhead, it's slower than my original (and incorrect) one by ~20%. The version with uniqExactIf is faster when the number of times I invoke it is low, but with a list of 10-20 invocations it gets slower. I guess I'll go with a subquery and runningDifferenceStartingWithFirstValue for now. Perhaps I'll look at the source code later and see if there are any obvious ways to "undo" the arrayJoin there.

Thanks a lot for your help!

Was this page helpful?
0 / 5 - 0 ratings