KSQL insists on having a GROUP BY column, even if logically it makes sense without.
For example: how many tweets were there in each hourly period?
ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
Workaround:
Step 1: Create a dummy column:
CREATE STREAM TWITTER2 AS SELECT 1 AS FOO,* FROM TWITTER;
Step 2: Group by dummy column:
SELECT FOO,COUNT(*) FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
Optionally, instantiate as a table and view window / aggregate:
CREATE TABLE TWEETS_PER_HOUR AS SELECT FOO,COUNT(*) AS TWEET_COUNT FROM TWITTER2 WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY FOO;
SELECT TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss.SSS') AS WINDOW_START , TWEET_COUNT FROM TWEETS_PER_HOUR;
2017-11-01 15:00:00.000 | 165
2017-11-01 16:00:00.000 | 72
...
My suggestion:
I'd strongly +1 adding workaround info to the error, if this can't be implemented quickly. Seasoned SQL hackers will be used to 'tricks' like this, but most others won't (and will totally expect KSQL to support this, and be disappointed if it doesn't).
One of the issue is that we are dealing with partitioned data, so in order to make this work we'd need to take all the data from the partitions and pipe it into a single topic partition to perform the count etc operations. For some operations, i.e, sum, count it could be done in two phases, i.e., on the partitions and then again as a final aggregate on the single topic partition, however, this won't work for all operations, i.e, if you are calculating an average.
That said, you could do this on hive. It would just generate the MapReduce job with a single aggregator at the end (which is essentially the same thing we'd need to do to support it).
After talking to @dguy offline, here's my understanding:
SUM and COUNT can be optimized so they are more efficient.)Given that we are busy with other work right now, I'd suggest:
WDYT @rmoff ?
+1 Sounds like a good approach to me (better error message immediately with workaround, longer term provide the functionality OOTB).
@miguno @hjafarpour If a verbose error message with the workaround is not an option, then I would advocate just giving a URL to this issue, so that people can find the workaround (and track progress on implementing the feature) for themselves.
So instead of:
ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause.
ksql>
ksql> select count(*) from twitter window tumbling (size 1 hour);
Aggregate query needs GROUP BY clause. See https://github.com/confluentinc/ksql/issues/430.
ksql>
@dguy what's the issue with calculating average? Can not each partition track a count and a sum, which can then be passed to a final aggregate stage?
I'm assuming this would require enhancements to KStreams, but I can't see any reason why it wouldn't work. (except issues with numerical overflows of course).
It doesn't need anything on the streams side it is just another object you write to a topic that gets consumed and can be used to aggregate. It just isn't currently supported in KSQL
Was wondering if there's any update on this issue after half a year?
Will there be any update on this issue? I think it will be really useful to have this soon in KSQL.
Hey @rmoff , just checking in on this. Is the work around still the solution to the problem? In that case, I can just send in a dummy constant value in my original topic and use that for group by in aggregates.
/cc @MichaelDrogalis @derekjn 鈽濓笍
We don't have this one queued up yet, but hope to soon. Patches always welcome in the meantime. 馃檹
Most helpful comment
Workaround:
Step 1: Create a dummy column:
Step 2: Group by dummy column:
Optionally, instantiate as a table and view window / aggregate: