We have been very encouraged by Clickhouse. However, as we are trying to port all of our existing scripts to Clickhouse, we are running into few roadblocks. For example: CUMULATIVE SUM or RUNNING TOTAL. We are trying to find an equivalent of Window Functions e.g. SUM(SALES) OVER (PARTITION BY PRODUCT ORDER BY SALES)
Is there a way to get Cumulative Sum or Running Total. Any inputs or guidance is much appreciated. Thanks!
Window functions are not supported currently, but there are chances that they will appear in the future, in some cases you can find workarounds. Possible directions: parametric aggregate functions, higher-order functions, subselects, and LIMIT BY statements.
For running total, there is a related function
https://clickhouse.yandex/docs/en/functions/other_functions.html#runningdifference-x
Latest release (v1.1.54310-stable) add support for a function runningIncome (not documented yet)
which should work just as runningDifference , but should calculate sum instead of difference. Check that commit: https://github.com/yandex/ClickHouse/pull/1427
With the following versions:
ClickHouse client version 1.1.54310
Connected to ClickHouse server version 1.1.54310.
It seems like both runningIncome and runningDifference are implementing difference operation.
The implementation implicates the same.
:) select number, runningDifference(number), runningIncome(number) from system.numbers limit 10;
SELECT
number,
runningDifference(number),
runningIncome(number)
FROM system.numbers
LIMIT 10
ββnumberββ¬βrunningDifference(number)ββ¬βrunningIncome(number)ββ
β 0 β 0 β 0 β
β 1 β 1 β 1 β
β 2 β 1 β 1 β
β 3 β 1 β 1 β
β 4 β 1 β 1 β
β 5 β 1 β 1 β
β 6 β 1 β 1 β
β 7 β 1 β 1 β
β 8 β 1 β 1 β
β 9 β 1 β 1 β
ββββββββββ΄ββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
10 rows in set. Elapsed: 0.009 sec.
cc @filimonov
@kszucs You are right. runningIncome() is a pretty misleading name for this function, so in the next release it will be renamed to runningDifferenceStartingWithFirstValue(). For the first row it outputs the input value instead of 0:
:) select number + 123 as x, runningDifference(x), runningDifferenceStartingWithFirstValue(x) from system.numbers limit 5;
SELECT
number + 123 AS x,
runningDifference(x),
runningDifferenceStartingWithFirstValue(x)
FROM system.numbers
LIMIT 5
ββββxββ¬βrunningDifference(plus(number, 123))ββ¬βrunningDifferenceStartingWithFirstValue(plus(number, 123))ββ
β 123 β 0 β 123 β
β 124 β 1 β 1 β
β 125 β 1 β 1 β
β 126 β 1 β 1 β
β 127 β 1 β 1 β
βββββββ΄βββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5 rows in set. Elapsed: 0.001 sec.
You can use the runningAccumulate() function that cumulatively merges aggregate function states to calculate running totals. Here is an example (calculating cumulative count of rows by date in a table):
:) SELECT date FROM t
SELECT date
FROM t
ββββββββdateββ
β 2018-01-08 β
β 2018-01-08 β
β 2018-01-09 β
β 2018-01-09 β
β 2018-01-09 β
β 2018-01-10 β
ββββββββββββββ
6 rows in set. Elapsed: 0.001 sec.
:) SELECT date, finalizeAggregation(state), runningAccumulate(state) FROM (SELECT date, countState() AS state FROM t GROUP BY date ORDER BY date)
SELECT
date,
finalizeAggregation(state),
runningAccumulate(state)
FROM
(
SELECT
date,
countState() AS state
FROM t
GROUP BY date
ORDER BY date ASC
)
ββββββββdateββ¬βfinalizeAggregation(state)ββ¬βrunningAccumulate(state)ββ
β 2018-01-08 β 2 β 2 β
β 2018-01-09 β 3 β 5 β
β 2018-01-10 β 1 β 6 β
ββββββββββββββ΄βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
3 rows in set. Elapsed: 0.002 sec.
Hey @ztlpn! Thanks for the clarification!
Currently I'm implementing ClickHouse backend for https://github.com/ibis-project/ibis which is a match made in heaven except for the missing window functionality in CH which I'd really like to use.
Any chance for a preliminary implementation for that in the near future?
Meanwhile could You suggest an alternative way (without a join) to compute e.g. zscore?
With window support It would like something like this:
select column - avg(column) over (partition by key) from t
AFAIK currently this requires a join:
select column - avg_column
from t left join (
select key, avg(column) as avg_column
from t
group by key
) _
using (key)
Cool! Of course window functions support is a must-have feature but it is a pretty big task and currently there is no definite timeline. You can get pretty far with arrays and ARRAY JOINs though. Your query becomes
select arrayJoin(values) - avg_value from (
select avg(value) as avg_value, groupArray(value) as values
from t group by key)
@ztlpn is it somehow possible to use runningDifference in combination with a group by? For example to calculate some lag-value for each unit.
I am also looking for a solution to this very problem: I would like to get running accumulates grouped by a certain key.
Any ideas?
My experience is xxxState (sumState, countState) will only work across 1-dimension, if you group using more than 1 columns, then the runningAccumulate will get messed up (not in sequence).
Just as a followup: I have seen in recent changelogs, that there is the arrayCumSum method, which can be used for getting at least the accumulate values. (Edit: You of course may use any other array operation, the following describes sort of a grouping workaround)
I am not sure whether this performs well enough on large scale, but as for testing, the following query seems to work:
user_key,
theday,
theevents,
thecumsum
from (
select
user_key,
days,
events,
arrayCumSum(events) as cumsum -- and calculate the cumulative values
from (
select -- put this into arrays
user_key,
groupArray(day) as days,
groupArray(events) as events
from (
select -- selecting counts on a daily basis for each user
user_key,
toDate(date_key) as day,
count(*) as events
from
fact_interaction
group by
user_key,
day
order by
day
)
group by
user_key
)
)
array Join -- this is the 'tricky' part transposing the arrays back into rows.
cumsum AS thecumsum,
events AS theevents,
days as theday
Wow! thanks for the heads-up! I will try this out.
My experience is xxxState (sumState, countState) will only work across 1-dimension, if you group using more than 1 columns, then the runningAccumulate will get messed up (not in sequence).
I have noticed this too.
Is there any way to make runningAccumulate work with more than 1 group by column ?
Is there a plan on when the window function will be part of a stable release tentatively?
Another way to make cumulative sum:
number_1,
arraySum(arraySlice(numbers, 1, i)) AS cum_sum
FROM
(
SELECT groupArray(number) AS numbers
FROM
(
SELECT number
FROM system.numbers
LIMIT 10
)
)
ARRAY JOIN
numbers AS number_1,
arrayEnumerate(numbers) AS i
ββnumber_1ββ¬βcum_sumββ
β 0 β 0 β
β 1 β 1 β
β 2 β 3 β
β 3 β 6 β
β 4 β 10 β
β 5 β 15 β
β 6 β 21 β
β 7 β 28 β
β 8 β 36 β
β 9 β 45 β
ββββββββββββ΄ββββββββββ
This is a pretty long issue. Let me post my findings and how we made this work.
For us the problem turned out to be the PARTITION BY toYYYYMM(time) which we of course found out only after being in production for 1 month :) We'd get breaks in runningDifference on the month boundary.
After many experiments it turns out that sorting by non-partition key first pushes CH into providing us with a single set of data. Providing a rather complex example with a group by on a key:
select bucket, sum(aggValue) as aggValue from (
select
key,
toDateTime(
toStartOfInterval(
time,
INTERVAL 9999999999 SECOND
)
) as bucket,
sumIf(runningDifference(value) as diff, diff > 0) as aggValue
from (select key, time, value from table
where key in (....)
order by key, time) t
group by key, bucket
)
group by bucket
order by bucket
If someone can spot any limitation here, we'd be glad to hear your feedback. And I hope this saves someone else a day or two.
those functions are hard to understand, i prefer the standard window functions (https://github.com/ClickHouse/ClickHouse/issues/5132).
as i know, both actian vector and sqlite realized them.
when the window function will be supported in Clickhouse?
I hope to see the function early!
when the window function will be supported in Clickhouse?
I hope to see the function early!
Most helpful comment
Just as a followup: I have seen in recent changelogs, that there is the
arrayCumSummethod, which can be used for getting at least the accumulate values. (Edit: You of course may use any other array operation, the following describes sort of a grouping workaround)I am not sure whether this performs well enough on large scale, but as for testing, the following query seems to work: