Clickhouse: Treatment of NaNs in aggregate functions

Created on 24 Dec 2019  Β·  9Comments  Β·  Source: ClickHouse/ClickHouse

This is how clickhouse (19.16.3 revision 54427) works:

select avg(val), sum(val), count(val)
from (
    select null val
    union ALL
    select NaN 
    union ALL
    select 1
    union all
    select 2
    union all
    select null val
    union ALL
    select NaN 
    union ALL
    select 3
    union all
    select 4
    )

returns

avg(val)|sum(val)|count(val)|
--------|--------|----------|
     NaN|     NaN|         6|

I know that clickhouse is not python, but in python there is heavily used machine learning pandas library which works like the following:

>>> import pandas as pd
>>> import numpy as np
>>> df =pd.DataFrame({'val':[None, np.nan, 1, 2, None, np.nan, 3, 4]})
>>> df
   val
0  NaN
1  NaN
2  1.0
3  2.0
4  NaN
5  NaN
6  3.0
7  4.0
>>> df.val.agg(['mean', 'sum', 'count'])
mean      2.5
sum      10.0
count     4.0
Name: val, dtype: float64
>>>

In clickhouse, if you have just a single NaN value, you cannot calculate avg, sum... anymore.
I believe that clickhouse shouldn't work like that, or at least that there should be an option whether to treat NaN values as nulls or not in aggregate functions.

For now, if I am for example summing sum(x/y) and a single y is zero, sum is NaN.
So, if I want clickhouse to work the same as python, I have to fight with NaNs.
Currently, I am doing something like this:

if(isNaN(expression), null, expression)

which makes everything hard readable.

Is current clickhouse behavior expected?
What is your suggestion how overcame this limitation?
What do you think about python's way of handling nulls/NaNs?
Will NaN behavior for aggregating functions change in the future?

As I can see, this is not the only one NaN issue (https://github.com/ClickHouse/ClickHouse/issues/6468#issue-480217238)

st-discussion

Most helpful comment

@ghuname You can also use -If aggregate function combinator (https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/) to ignore NaN values.

Here is a simple example:

WITH arrayJoin([1, 2, 3, nan, 4, 5]) AS data
SELECT
    avg(data),
    avgIf(data, isFinite(data))

β”Œβ”€avg(data)─┬─avgIf(data, isFinite(data))─┐
β”‚       nan β”‚                           3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.003 sec.

All 9 comments

We cannot just skip NaNs for all aggregate functions. NaNs make sense for any, argMax, groupArray, uniq.

Skipping NaNs just for some aggregate functions looks like unnesessary complexity.
Note that SQL standard doesn't have NaNs (and any other demormals) at all.
MySQL returns NULL for 0 / 0 and PostgreSQL throws exception.

I have extended the select:
select avg(val), sum(val), count(val), any(val), groupArray(val), uniq(val), uniqExact(val), max(val), min(val)

This is what I get:

β”Œβ”€avg(val)─┬─sum(val)─┬─count(val)─┬─any(val)─┬─groupArray(val)───┬─uniq(val)─┬─uniqExact(val)─┬─max(val)─┬─min(val)─┐
β”‚      nan β”‚      nan β”‚          6 β”‚      nan β”‚ [nan,nan,4,1,3,2] β”‚         5 β”‚              6 β”‚      nan β”‚      nan β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I don't want to argue here, but from my point of database user view, because the way you are treating NaNs, I cannot use aggregating functions anymore.

In order to use them, I have to test for NaN, and if I find NaN, I should change it to Null.
In this case, a new function that will convert NaN to Null could help.

Anyway, thanks for clarification Alexey.

Best regards.

PS NaN behavior should be thoroughly documented.

Just to add how it looks in practice.
I have to replace this

sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h))

with

if(isNan(sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h))), null, sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)))

in order to get NaN functionality as it is in python.

@ghuname You can also use -If aggregate function combinator (https://clickhouse.yandex/docs/en/query_language/agg_functions/combinators/) to ignore NaN values.

Here is a simple example:

WITH arrayJoin([1, 2, 3, nan, 4, 5]) AS data
SELECT
    avg(data),
    avgIf(data, isFinite(data))

β”Œβ”€avg(data)─┬─avgIf(data, isFinite(data))─┐
β”‚       nan β”‚                           3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.003 sec.

In my case, all aggregates can be of finite data, but if (sum(sum_weight) * sum(di * workingSec24h)) = 0, it will become NaN, so I have to check the whole expression whether it is NaN or not, or check whether (sum(sum_weight) * sum(di * workingSec24h)) is zero or not.

Anyway thanks for avgIf suggestion.

@ghuname in any case you can use WITH clause (https://clickhouse.yandex/docs/en/query_language/select/#with-clause) to reduce code duplication. It will be something like this:

WITH
  sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)) AS calculation
SELECT
  sumIf(calculation, isFinite(calculation)),
  avgIf(calculation, isFinite(calculation)),
  countIf(calculation, isFinite(calculation))
FROM ...

Yes, this is much better, but I found another one:

select nullIf(sum(weight * di * workingSec24h * errorsUpXday) / (sum(sum_weight) * sum(di * workingSec24h)), NaN)
I just hope that clickhouse developers will not change nullIf(NaN, NaN) = Null in the future.

...but I have just found a new "feature". Please look at this:

    nullIf(nan, nan), 
    0 / 0, 
    nullIf(0 / 0, nan), 
    nullIf(nan, 0 / 0)

β”Œβ”€nullIf(nan, nan)─┬─divide(0, 0)─┬─nullIf(divide(0, 0), nan)─┬─nullIf(nan, divide(0, 0))─┐
β”‚             ᴺᡁᴸᴸ β”‚          nan β”‚                       nan β”‚                       nan β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How come that all nullIf expressions are not null?
Is this a bug or there is an explanation?

@vpanfilov
After clarification in https://github.com/ClickHouse/ClickHouse/issues/8437, the best solution is as you have suggested in https://github.com/ClickHouse/ClickHouse/issues/8369#issuecomment-569211904.
Thanks and best regards.

Was this page helpful?
0 / 5 - 0 ratings