This is how clickhouse (19.16.3 revision 54427) works:
select avg(val), sum(val), count(val)
from (
select null val
union ALL
select NaN
union ALL
select 1
union all
select 2
union all
select null val
union ALL
select NaN
union ALL
select 3
union all
select 4
)
returns
avg(val)|sum(val)|count(val)|
--------|--------|----------|
NaN| NaN| 6|
I know that clickhouse is not python, but in python there is heavily used machine learning pandas library which works like the following:
>>> import pandas as pd
>>> import numpy as np
>>> df =pd.DataFrame({'val':[None, np.nan, 1, 2, None, np.nan, 3, 4]})
>>> df
val
0 NaN
1 NaN
2 1.0
3 2.0
4 NaN
5 NaN
6 3.0
7 4.0
>>> df.val.agg(['mean', 'sum', 'count'])
mean 2.5
sum 10.0
count 4.0
Name: val, dtype: float64
>>>
In clickhouse, if you have just a single NaN value, you cannot calculate avg, sum... anymore.
I believe that clickhouse shouldn't work like that, or at least that there should be an option whether to treat NaN values as nulls or not in aggregate functions.
For now, if I am for example summing sum(x/y) and a single y is zero, sum is NaN.
So, if I want clickhouse to work the same as python, I have to fight with NaNs.
Currently, I am doing something like this:
if(isNaN(expression), null, expression)
which makes everything hard readable.
Is current clickhouse behavior expected?
What is your suggestion how overcame this limitation?
What do you think about python's way of handling nulls/NaNs?
Will NaN behavior for aggregating functions change in the future?
As I can see, this is not the only one NaN issue (https://github.com/ClickHouse/ClickHouse/issues/6468#issue-480217238)
We cannot just skip NaNs for all aggregate functions. NaNs make sense for any, argMax, groupArray, uniq.
Skipping NaNs just for some aggregate functions looks like unnesessary complexity.
Note that SQL standard doesn't have NaNs (and any other demormals) at all.
MySQL returns NULL for 0 / 0 and PostgreSQL throws exception.
I have extended the select:
select avg(val), sum(val), count(val), any(val), groupArray(val), uniq(val), uniqExact(val), max(val), min(val)
This is what I get:
ββavg(val)ββ¬βsum(val)ββ¬βcount(val)ββ¬βany(val)ββ¬βgroupArray(val)ββββ¬βuniq(val)ββ¬βuniqExact(val)ββ¬βmax(val)ββ¬βmin(val)ββ
β nan β nan β 6 β nan β [nan,nan,4,1,3,2] β 5 β 6 β nan β nan β
ββββββββββββ΄βββββββββββ΄βββββββββββββ΄βββββββββββ΄ββββββββββββββββββββ΄ββββββββββββ΄βββββββββββββββββ΄βββββββββββ΄βββββββββββ
I don't want to argue here, but from my point of database user view, because the way you are treating NaNs, I cannot use aggregating functions anymore.
In order to use them, I have to test for NaN, and if I find NaN, I should change it to Null.
In this case, a new function that will convert NaN to Null could help.
Anyway, thanks for clarification Alexey.
Best regards.
PS NaN behavior should be thoroughly documented.
Just to add how it looks in practice.
I have to replace this
sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h))
with
if(isNan(sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h))), null, sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)))
in order to get NaN functionality as it is in python.
@ghuname You can also use -If aggregate function combinator (https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/) to ignore NaN values.
Here is a simple example:
WITH arrayJoin([1, 2, 3, nan, 4, 5]) AS data
SELECT
avg(data),
avgIf(data, isFinite(data))
ββavg(data)ββ¬βavgIf(data, isFinite(data))ββ
β nan β 3 β
βββββββββββββ΄ββββββββββββββββββββββββββββββ
1 rows in set. Elapsed: 0.003 sec.
In my case, all aggregates can be of finite data, but if (sum(sum_weight) * sum(di * workingSec24h)) = 0, it will become NaN, so I have to check the whole expression whether it is NaN or not, or check whether (sum(sum_weight) * sum(di * workingSec24h)) is zero or not.
Anyway thanks for avgIf suggestion.
@ghuname in any case you can use WITH clause (https://clickhouse.yandex/docs/en/query_language/select/#with-clause) to reduce code duplication. It will be something like this:
WITH
sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)) AS calculation
SELECT
sumIf(calculation, isFinite(calculation)),
avgIf(calculation, isFinite(calculation)),
countIf(calculation, isFinite(calculation))
FROM ...
Yes, this is much better, but I found another one:
select nullIf(sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)), NaN)
I just hope that clickhouse developers will not change nullIf(NaN, NaN) = Null in the future.
...but I have just found a new "feature". Please look at this:
nullIf(nan, nan),
0 / 0,
nullIf(0 / 0, nan),
nullIf(nan, 0 / 0)
ββnullIf(nan, nan)ββ¬βdivide(0, 0)ββ¬βnullIf(divide(0, 0), nan)ββ¬βnullIf(nan, divide(0, 0))ββ
β α΄Ία΅α΄Έα΄Έ β nan β nan β nan β
ββββββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
How come that all nullIf expressions are not null?
Is this a bug or there is an explanation?
@vpanfilov
After clarification in https://github.com/ClickHouse/ClickHouse/issues/8437, the best solution is as you have suggested in https://github.com/ClickHouse/ClickHouse/issues/8369#issuecomment-569211904.
Thanks and best regards.
Most helpful comment
@ghuname You can also use
-Ifaggregate function combinator (https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/) to ignore NaN values.Here is a simple example: