Pandas: DataFrame.groupby().sum() treating Nan as 0.0

Created on 25 Apr 2018 · 15Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0

Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Groupby Numeric

Source

amanhanda

👍8

Most helpful comment

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

```
import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=2), None, timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
]

df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)

Expected:
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

Found a workaround, namely to use

res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))

instead of

res=df.groupby(level=0).sum(min_count=1)

kitschen on 10 Feb 2020

👍7

All 15 comments

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

TomAugspurger on 25 Apr 2018

👍1

This is a bit surprising

In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

Something strange w/ the skipna keyword there.

TomAugspurger on 25 Apr 2018

Thanks! Did not think of removing skipna=False. skipna behavior should be consistent.

amanhanda on 25 Apr 2018

I think there are two intertwined issues

DataFrameGroupby.sum doesn't accept skipna
DataFrameGroupby.sum doesn't validate its kwargs, and falls back to a secondary method

In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

so passing skipna forces the fallback, which apparently ignores the kwargs.

TomAugspurger on 25 Apr 2018

👍3

https://github.com/pandas-dev/pandas/issues/15675 for the skipna part.

TomAugspurger on 25 Apr 2018

String values trigger the fallback too:

In [16]: pd.DataFrame([[1, np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[16]:
     0   1
0  1.0 NaN

In [17]: pd.DataFrame([['a', np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[17]:
   0    1
0  a  0.0

I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):

try:
    return self._cython_agg_general(
        alias, alt=npfunc, **kwargs)
    except AssertionError as e:
        raise SpecificationError(str(e))
    except Exception:
        result = self.aggregate(
            lambda x: npfunc(x, axis=self.axis))

That except Exception is dangerous.

gsganden on 6 Jun 2018

Does this issue still need to be resolved. If so I'd like to look into this.

mukundm19 on 8 Apr 2019

Yes please.

From: mukundm19 notifications@github.com
Sent: Monday, April 8, 2019 12:52 PM
To: pandas-dev/pandas pandas@noreply.github.com
Cc: Handa, Aman Aman.Handa@citadel.com; Author author@noreply.github.com
Subject: [EXT] Re: [pandas-dev/pandas] DataFrame.groupby().sum() treating Nan as 0.0 (#20824)

Does this issue still need to be resolved. If so I'd like to look into this.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_20824-23issuecomment-2D480935840&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=Cbs63AslxT5mes6buZUkygBOYrfofo4qiL8xzyR0PWs&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEQYhFt3PrCNpLwawINPWjBLhruFX8bYks5ve4GxgaJpZM4TkIlb&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=6DYjKZwrpvONq__spnUOuiStJWdIJOBi8rKV1nvEUlo&e=.

CONFIDENTIALITY AND SECURITY NOTICE

The contents of this message and any attachments may be confidential and proprietary. If you are not an intended recipient, please inform the sender of the transmission error and delete this message immediately without reading, distributing or copying the contents.

amanhanda on 8 Apr 2019

As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero.

mukundm19 on 21 Apr 2019

Screen Shot 2019-04-21 at 2 14 51 PM

Showing some of my debugging to help anyone who might be able to take this on. Was able to find that the issue coming from the _aggregate() function.

mukundm19 on 22 Apr 2019

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

```
import pandas as pd
from datetime import datetime, date, timedelta

df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)

Expected:
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

Found a workaround, namely to use

res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))

instead of

res=df.groupby(level=0).sum(min_count=1)

kitschen on 10 Feb 2020

👍7

Still an issue in v1.0.3
There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero.
Also a problem for mean and std as well as sum. For example

df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
  col1  col2
0    a   NaN
1    a   2.0
2    b   NaN
3    c   3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
      sum_col2  mean_col2
col1                     
a          2.0        2.0
b          0.0        NaN
c          3.0        3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan

Therefore, I would expect df_2 to be

      sum_col2  mean_col2
col1                     
a          NaN        NaN
b          NaN        NaN
c          3.0        3.0

Same unexpected result with

df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean))

Also the min_count=1 suggestion does not solve the problem, for example

df_4 = pd.DataFrame({
    'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'), 
     'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
      col2
col1      
a      2.0
b      NaN
c      3.0
d      9.0
e      6.0

where I where expect df_5 to be

      col2
col1      
a      NaN
b      NaN
c      3.0
d      NaN
e      NaN

Also problems with std, but that seems more confusing.
Should this be split out into separate issues or it is the same underlying problem and can be kept as one?

pd.__version__
'1.0.3'
np.__version__
'1.18.4'

edn01 on 8 May 2020

😕1

I might try a Pull Request to solve this.
But for now, I was able to by pass the NaN as 0 problem.

I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby.

Then, I assign numpy.nan to everything that resulted in numpy.inf.

There's an example I posted in this stackoverflow discussion:

https://stackoverflow.com/questions/62069979/pandas-merge-with-conditionnal-aggregation/62071652#62071652

lucasthim on 29 May 2020

If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:

def np_sum(g):
    return np.sum(g.values)

df = pd.DataFrame(data={'name': ['a', 'a', 'b', 'b', 'c'], 'data': [1, np.nan, 1, 1, np.nan]})
g = df.groupby(['name'])
g.agg(col1=('data', 'sum'), col2=('data', np_sum), col3=('data', np.sum), col4=('data', np.nansum))

col1 | col2 | col3 | col4
-- | -- | -- | --
1 | NaN | 1 | 1
2 | 2 | 2 | 2
0 | NaN | 0 | 0

attack68 on 8 Oct 2020

The actual improvement to add skipna support to groupby reductions is covered in https://github.com/pandas-dev/pandas/issues/15675, so I think this can be closed or repurposed to a documentation issue, to ensure we have clear documentation about this common tricky case.