Clickhouse: group by every(N, Unit): aggregator for time window?

Created on 22 Jan 2017  Β·  11Comments  Β·  Source: ClickHouse/ClickHouse

Hello,

The reason drives me to read ClickHouse code is analyzing on big time-series data. It is very common that we want to generate some reports base on different time periods, say "every 5 hours", "every week", "every 15 days". I have downloaded ontime demo database and played around for some days. I notice in the ontime demo database we insert some fields, such as Year, Quarter, DayOfMonth, DayOfWeek to solve the similar problem. I am a newbie to ClickHouse after all, and my questions are here:

  1. Is it possible to develop aggregate function, say EVERY(N, Unit) to solve the case I mentioned.
  2. If yes, how about the performance of such aggregate function? (compare to the static field in table schema)
  3. If yes, how can I start to develop such aggregate function? I have read the docs under doc/developers and known some basic concepts。But have no idea on how to assemble the code and how to test (neither functional tests nor performance tests).

And more in the tests document, I find below statements may be related to functional tests, but no such script name clickhouse-test or 00395_nullable test case found.

163 issue leads me to the right place to test script and test cases

Thank you!

Most helpful comment

We don't have "fill" option for GROUP BY, so you need to fill missing points on client. Currently, we considering for implementation something like GROUP BY ... WITH FILL.

@alexey-milovidov We are trying to create a timeseries chart using highcharts and for this, we are grouping the data every five minutes.. the missing entries as you have said, we are currently planning to add it on Client side. However, is there a timeline or a feature plan that clickhouse will add WITH FILL option with GROUP BY. Or is there any other way to accomplish this.

This will be very helpful feature for timeseries charts.

All 11 comments

If I understand this correctly, you're looking for a way of grouping aggregates instead of an aggregate function itself, I think.

The way KDB+ solves this is xbar: group by intervals, basically.

In Q (KDB+ query language): select last price by 5 xbar time.hour from daily would select the final price in every 5 hour interval.

I'm not yet sure how this would be implemented in ClickHouse, but it's an approach to consider.

hi @alpha, you are right. I am looking for some grouping aggregates :) And thanks for the link, it looks so interesting.

These days I read InfluxDB reference, and I found InfluxQL had implemented the similar aggregates.
The basic sample is

SELECT <function>(<field_key>) FROM_clause WHERE <time_range> GROUP BY time(<time_interval>),[tag_key] [fill(<fill_option>)]

More examples refer to https://docs.influxdata.com/influxdb/v1.1/query_language/data_exploration/#basic-group-by-time-syntax

ClickHouse has functions to rounding Date and DateTime values, such as toStartOfWeek, toStartOfHour, toStartOfFiveMinute... These functions are well optimized and performance is comparable to just using pre-computed value.

ontime example schema have precomputed columns only because it is the original schema of that table.

Usually it is reasonable to have separate Date column with DateTime rounded to Date, because it compresses much better and it is much more fast to read. BTW, MergeTree tables enforce you to have Date column.

We don't have "fill" option for GROUP BY, so you need to fill missing points on client. Currently, we considering for implementation something like GROUP BY ... WITH FILL.

Thanks @alexey-milovidov
It's of great help.

I'm a bit confused but having toStartOf* how I can group results for example every 3 months based on a DateTime column...

You can group by intDiv(toRelativeMonthNum(date), 3).

@ztlpn thanks ! now it's clear. probably you can close the issue as well

I think this issue can be reopened as IMHO it would be handy to have such grouping function in CH.

Have anyone found any workaround for this except for computing it on client side which is not a very good solution?

@simPod what exactly do you think is missing from the available groupping by subdivisions?

Considering such example of a table:
CREATE TABLE default.analog_archives ( d Date DEFAULT CAST(0 AS Date), ts UInt32, ids UInt32, val Float32)..

you can specify any reasonable (ok - down to intervals rounded to 5 min unless you are fine with timestamps retrieved from the table for the subdivision range without rounding)
time division and apply aggregation function to it:

mysql> select min(ids),toStartOfHour(toDateTime(min(ts))) as date,min(val) as min,avg(val) as avg,max(val),bar(avg,-10,100,40) as bar from analog_archives where ids = 118 and toDateTime(ts) > '2015-10-28 12:00:00' and toDateTime(ts) < '2015-10-29 10:00:00' group by ids,intDiv(toRelativeHourNum(toDateTime(ts)),4) order by ids,min(ts);
+----------+---------------------+-----------+-----------+-----------+-----------------------+
| min(ids) | date | min | avg | max(val) | bar |
+----------+---------------------+-----------+-----------+-----------+-----------------------+
| 118 | 2015-10-28 12:00:00 | -0.141026 | 35.579608 | 54.967899 | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž
| 118 | 2015-10-28 16:00:00 | -0.823718 | -0.512820 | -0.150641 | β–ˆβ–ˆβ–ˆβ–‹
| 118 | 2015-10-28 20:00:00 | -0.955128 | 2.712709 | 8.560900 | β–ˆβ–ˆβ–ˆβ–ˆβ–Ž
| 118 | 2015-10-29 00:00:00 | -1.000000 | 0.017213 | 8.413460 | β–ˆβ–ˆβ–ˆβ–‹
| 118 | 2015-10-29 04:00:00 | -1.032050 | 28.648836 | 43.054501 | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹
| 118 | 2015-10-29 08:00:00 | 30.490400 | 32.948331 | 37.240398 | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž
+----------+---------------------+-----------+-----------+-----------+-------------------------+
6 rows in set (1.11 sec)

by changin intDiv(toRelativeHourNum(toDateTime(ts)),4) to intDiv(toRelativeHourNum(toDateTime(ts)),8) or anything else you get differnt time window the aggregation functions are applied to.

@girgitt doing it this way didn't come to my mind. Thanks.

Currently, for some cases I'm using this function https://docs.timescale.com/v0.10/api#time_bucket Would be nice if ClickHouse had something similar to improve query readability.

We don't have "fill" option for GROUP BY, so you need to fill missing points on client. Currently, we considering for implementation something like GROUP BY ... WITH FILL.

@alexey-milovidov We are trying to create a timeseries chart using highcharts and for this, we are grouping the data every five minutes.. the missing entries as you have said, we are currently planning to add it on Client side. However, is there a timeline or a feature plan that clickhouse will add WITH FILL option with GROUP BY. Or is there any other way to accomplish this.

This will be very helpful feature for timeseries charts.

Was this page helpful?
0 / 5 - 0 ratings