Pandas: DataFrame.groupby().sum() treating Nan as 0.0

Created on 25 Apr 2018  Â·  15Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0


Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Groupby Numeric

Most helpful comment

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

```
import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=2), None, timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
]

df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)

Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00


Found a workaround, namely to use

res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))

instead of

res=df.groupby(level=0).sum(min_count=1)

All 15 comments

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

This is a bit surprising

In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

Something strange w/ the skipna keyword there.

Thanks! Did not think of removing skipna=False. skipna behavior should be consistent.

I think there are two intertwined issues

  1. DataFrameGroupby.sum doesn't accept skipna
  2. DataFrameGroupby.sum doesn't validate its kwargs, and falls back to a secondary method
In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

so passing skipna forces the fallback, which apparently ignores the kwargs.

String values trigger the fallback too:

In [16]: pd.DataFrame([[1, np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[16]:
     0   1
0  1.0 NaN

In [17]: pd.DataFrame([['a', np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[17]:
   0    1
0  a  0.0

I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):

try:
    return self._cython_agg_general(
        alias, alt=npfunc, **kwargs)
    except AssertionError as e:
        raise SpecificationError(str(e))
    except Exception:
        result = self.aggregate(
            lambda x: npfunc(x, axis=self.axis))

That except Exception is dangerous.

Does this issue still need to be resolved. If so I'd like to look into this.

Yes please.

From: mukundm19 notifications@github.com
Sent: Monday, April 8, 2019 12:52 PM
To: pandas-dev/pandas pandas@noreply.github.com
Cc: Handa, Aman Aman.Handa@citadel.com; Author author@noreply.github.com
Subject: [EXT] Re: [pandas-dev/pandas] DataFrame.groupby().sum() treating Nan as 0.0 (#20824)

Does this issue still need to be resolved. If so I'd like to look into this.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_20824-23issuecomment-2D480935840&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=Cbs63AslxT5mes6buZUkygBOYrfofo4qiL8xzyR0PWs&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEQYhFt3PrCNpLwawINPWjBLhruFX8bYks5ve4GxgaJpZM4TkIlb&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=6DYjKZwrpvONq__spnUOuiStJWdIJOBi8rKV1nvEUlo&e=.


CONFIDENTIALITY AND SECURITY NOTICE

The contents of this message and any attachments may be confidential and proprietary. If you are not an intended recipient, please inform the sender of the transmission error and delete this message immediately without reading, distributing or copying the contents.

As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero.

Screen Shot 2019-04-21 at 2 14 51 PM

Showing some of my debugging to help anyone who might be able to take this on. Was able to find that the issue coming from the _aggregate() function.

I'm using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

```
import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=2), None, timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
]

df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)

Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00


Found a workaround, namely to use

res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))

instead of

res=df.groupby(level=0).sum(min_count=1)

Still an issue in v1.0.3
There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero.
Also a problem for mean and std as well as sum. For example

df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
  col1  col2
0    a   NaN
1    a   2.0
2    b   NaN
3    c   3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
      sum_col2  mean_col2
col1                     
a          2.0        2.0
b          0.0        NaN
c          3.0        3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan

Therefore, I would expect df_2 to be

      sum_col2  mean_col2
col1                     
a          NaN        NaN
b          NaN        NaN
c          3.0        3.0

Same unexpected result with

df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean))

Also the min_count=1 suggestion does not solve the problem, for example

df_4 = pd.DataFrame({
    'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'), 
     'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
      col2
col1      
a      2.0
b      NaN
c      3.0
d      9.0
e      6.0

where I where expect df_5 to be

      col2
col1      
a      NaN
b      NaN
c      3.0
d      NaN
e      NaN

Also problems with std, but that seems more confusing.
Should this be split out into separate issues or it is the same underlying problem and can be kept as one?

pd.__version__
'1.0.3'
np.__version__
'1.18.4'

I might try a Pull Request to solve this.
But for now, I was able to by pass the NaN as 0 problem.

I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby.

Then, I assign numpy.nan to everything that resulted in numpy.inf.

There's an example I posted in this stackoverflow discussion:

https://stackoverflow.com/questions/62069979/pandas-merge-with-conditionnal-aggregation/62071652#62071652

If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:

def np_sum(g):
    return np.sum(g.values)

df = pd.DataFrame(data={'name': ['a', 'a', 'b', 'b', 'c'], 'data': [1, np.nan, 1, 1, np.nan]})
g = df.groupby(['name'])
g.agg(col1=('data', 'sum'), col2=('data', np_sum), col3=('data', np.sum), col4=('data', np.nansum))

col1 | col2 | col3 | col4
-- | -- | -- | --
1 | NaN | 1 | 1
2 | 2 | 2 | 2
0 | NaN | 0 | 0

The actual improvement to add skipna support to groupby reductions is covered in https://github.com/pandas-dev/pandas/issues/15675, so I think this can be closed or repurposed to a documentation issue, to ensure we have clear documentation about this common tricky case.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BDannowitz picture BDannowitz  Â·  3Comments

scls19fr picture scls19fr  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments

idanivanov picture idanivanov  Â·  3Comments

venuktan picture venuktan  Â·  3Comments