Ksql: Cannot create aggregate without a GROUP BY

Created on 1 Nov 2017 · 15Comments · Source: confluentinc/ksql

KSQL insists on having a GROUP BY column, even if logically it makes sense without.

For example: how many tweets were there in each hourly period?

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>

enhancement fix-it-week

Source

rmoff

👍17

Most helpful comment

Workaround:

Step 1: Create a dummy column:

CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;

Step 2: Group by dummy column:

SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;

Optionally, instantiate as a table and view window / aggregate:

CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...

rmoff on 1 Nov 2017

👍7

All 15 comments

Workaround:

Step 1: Create a dummy column:

CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;

Step 2: Group by dummy column:

SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;

Optionally, instantiate as a table and view window / aggregate:

CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...

rmoff on 1 Nov 2017

👍7

My suggestion:

Ideally, can we find a way to support the original (non-grouped) query out of the box in some form?
Alternatively, can we improve the error message and e.g. add information on the workaround that @rmoff mentioned (or whatever else we deem appropriate)? I'd doubt that the majority of users would easily and quickly be able to come up with that workaround themselves.

miguno on 5 Feb 2018

I'd strongly +1 adding workaround info to the error, if this can't be implemented quickly. Seasoned SQL hackers will be used to 'tricks' like this, but most others won't (and will totally expect KSQL to support this, and be disappointed if it doesn't).

rmoff on 5 Feb 2018

One of the issue is that we are dealing with partitioned data, so in order to make this work we'd need to take all the data from the partitions and pipe it into a single topic partition to perform the count etc operations. For some operations, i.e, sum, count it could be done in two phases, i.e., on the partitions and then again as a final aggregate on the single topic partition, however, this won't work for all operations, i.e, if you are calculating an average.
That said, you could do this on hive. It would just generate the MapReduce job with a single aggregator at the end (which is essentially the same thing we'd need to do to support it).

dguy on 7 Feb 2018

After talking to @dguy offline, here's my understanding:

There is a workaround as @rmoff showed, but users may not know about it.
For the general case the desired functionality requires, essentially, the input data to be written to a single partition. This works, of course, but is not super efficient. (Good news is that some functions such as SUM and COUNT can be optimized so they are more efficient.)
Both the workaround and a possible OOTB functionality suffer from the same inefficiency described in point 2. The difference between the workaround and OOTB would be that, well, OOTB has a much better UX. But there's no technical hurdle to provide this OOTB (just to mention this clearly).

Given that we are busy with other work right now, I'd suggest:

Short-term: We provide a more descriptive and actionable error message, e.g. possibly telling users about the workaround.
We'll work on supporting this OOTB (no workaround needed any longer) in the future, i.e. once we can move this ticket to the top of our backlog.

miguno on 7 Feb 2018

👍2

WDYT @rmoff ?

miguno on 7 Feb 2018

+1 Sounds like a good approach to me (better error message immediately with workaround, longer term provide the functionality OOTB).

rmoff on 8 Feb 2018

@miguno @hjafarpour If a verbose error message with the workaround is not an option, then I would advocate just giving a URL to this issue, so that people can find the workaround (and track progress on implementing the feature) for themselves.

So instead of:

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>

ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause. See https://github.com/confluentinc/ksql/issues/430.
ksql>

rmoff on 22 Feb 2018

👍2

@dguy what's the issue with calculating average? Can not each partition track a count and a sum, which can then be passed to a final aggregate stage?

I'm assuming this would require enhancements to KStreams, but I can't see any reason why it wouldn't work. (except issues with numerical overflows of course).

big-andy-coates on 1 Mar 2018

It doesn't need anything on the streams side it is just another object you write to a topic that gets consumed and can be used to aggregate. It just isn't currently supported in KSQL

dguy on 1 Mar 2018

Was wondering if there's any update on this issue after half a year?

Jianyi-Ren on 1 Aug 2018

Will there be any update on this issue? I think it will be really useful to have this soon in KSQL.

janicesn0w on 5 Apr 2019

👍5

Hey @rmoff , just checking in on this. Is the work around still the solution to the problem? In that case, I can just send in a dummy constant value in my original topic and use that for group by in aggregates.