In [62]: import pandas as pd
In [63]: import numpy as np
In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])
In [68]: df
Out[68]:
a b c
1 data1 2 NaN
2 data2 3 4.0
3 data3 4 4.0
In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
c
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?
c
a b
data1 2 NaN
data2 3 4.0
data3 4 4.0
pd.show_versions()
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 5.6.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.2
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.3.2
html5lib: 0.999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
I think you want min_count
:
In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64
In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a b
data1 2 NaN
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64
This is a bit surprising
In [23]: df.groupby(['a', 'b']).c.sum(min_count=1, skipna=False)
Out[23]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64
Something strange w/ the skipna
keyword there.
Thanks! Did not think of removing skipna=False. skipna behavior should be consistent.
I think there are two intertwined issues
skipna
In [27]: df.groupby(['a', 'b']).c.sum(min_count=1, foo=1)
Out[27]:
a b
data1 2 0.0
data2 3 4.0
data3 4 4.0
Name: c, dtype: float64
so passing skipna
forces the fallback, which apparently ignores the kwargs.
https://github.com/pandas-dev/pandas/issues/15675 for the skipna part.
String values trigger the fallback too:
In [16]: pd.DataFrame([[1, np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[16]:
0 1
0 1.0 NaN
In [17]: pd.DataFrame([['a', np.nan]]).groupby(lambda x: x, axis='columns').sum(min_count=1)
Out[17]:
0 1
0 a 0.0
I think the fallback is happening here (https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py):
try:
return self._cython_agg_general(
alias, alt=npfunc, **kwargs)
except AssertionError as e:
raise SpecificationError(str(e))
except Exception:
result = self.aggregate(
lambda x: npfunc(x, axis=self.axis))
That except Exception
is dangerous.
Does this issue still need to be resolved. If so I'd like to look into this.
Yes please.
From: mukundm19 notifications@github.com
Sent: Monday, April 8, 2019 12:52 PM
To: pandas-dev/pandas pandas@noreply.github.com
Cc: Handa, Aman Aman.Handa@citadel.com; Author author@noreply.github.com
Subject: [EXT] Re: [pandas-dev/pandas] DataFrame.groupby().sum() treating Nan as 0.0 (#20824)
Does this issue still need to be resolved. If so I'd like to look into this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_20824-23issuecomment-2D480935840&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=Cbs63AslxT5mes6buZUkygBOYrfofo4qiL8xzyR0PWs&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEQYhFt3PrCNpLwawINPWjBLhruFX8bYks5ve4GxgaJpZM4TkIlb&d=DwMCaQ&c=8wjZCRFA8JOuiZlSscjqGnniqOsI1ojYgnrGIlBL6Lc&r=1aiCxfcw6Lwbn0mjDKqaQbpH9qm7ly3Rzs197inLhng&m=t1cGiy-Eu99uctmrNRpiHej4OZCn6Z-wFkQrUccofs8&s=6DYjKZwrpvONq__spnUOuiStJWdIJOBi8rKV1nvEUlo&e=.
CONFIDENTIALITY AND SECURITY NOTICE
The contents of this message and any attachments may be confidential and proprietary. If you are not an intended recipient, please inform the sender of the transmission error and delete this message immediately without reading, distributing or copying the contents.
As was mentioned, fallback was occuring when df.Groupby().sum() was called with the skipna flag. This was occurring because the _cython_agg_general function was not accepting the argument, which has now been fixed by the PR #26179 . The fallback still occurs with strings in the df, however this seems to be a deeper issue stemming from the _aggregate() call in groupby/ops.py (line 572) which is what converts the NaN to a zero.
Showing some of my debugging to help anyone who might be able to take this on. Was able to find that the issue coming from the _aggregate() function.
I'm using latest v1.0.1 but still see this issue. Also the min_count=1
argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan
in a groupy().sum()
?
```
import pandas as pd
from datetime import datetime, date, timedelta
data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=2), None, timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
]
df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)
res=df.groupby(level=0).sum(min_count=1)
display(res)
Expected:
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00
But getting
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00
Found a workaround, namely to use
res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))
instead of
res=df.groupby(level=0).sum(min_count=1)
Still an issue in v1.0.3
There is inconsistent behavior between pandas and numpy. NaN should not be treated as zero.
Also a problem for mean and std as well as sum. For example
df_1 = pd.DataFrame({'col1': ('a', 'a', 'b', 'c'), 'col2': (np.NaN, 2, np.NaN, 3)})
df_1
col1 col2
0 a NaN
1 a 2.0
2 b NaN
3 c 3.0
df_2 = df_1.groupby('col1').agg(sum_col2=('col2', 'sum'), mean_col2=('col2', 'mean'))
df_2
sum_col2 mean_col2
col1
a 2.0 2.0
b 0.0 NaN
c 3.0 3.0
np.mean([np.NaN])
nan
np.sum([np.NaN])
nan
np.mean([np.NaN, 2])
nan
np.sum([np.NaN, 2])
nan
Therefore, I would expect df_2 to be
sum_col2 mean_col2
col1
a NaN NaN
b NaN NaN
c 3.0 3.0
Same unexpected result with
df_3 = df_1.groupby('col1').agg(sum_col2=('col2', np.sum), mean_col2=('col2', np.mean))
Also the min_count=1 suggestion does not solve the problem, for example
df_4 = pd.DataFrame({
'col1': ('a', 'a', 'b', 'c', 'd', 'd', 'd', 'e', 'e', 'e'),
'col2': (np.NaN, 2, np.NaN, 3, 4, 5, np.NaN, 6, np.NaN, np.NaN)
})
df_5 = df_4.groupby('col1').sum(min_count=1)
df_5
col2
col1
a 2.0
b NaN
c 3.0
d 9.0
e 6.0
where I where expect df_5 to be
col2
col1
a NaN
b NaN
c 3.0
d NaN
e NaN
Also problems with std, but that seems more confusing.
Should this be split out into separate issues or it is the same underlying problem and can be kept as one?
pd.__version__
'1.0.3'
np.__version__
'1.18.4'
I might try a Pull Request to solve this.
But for now, I was able to by pass the NaN as 0 problem.
I assign numpy.inf to NaN values in my columns and then execute whatever function (prod,mean,sum) with groupby.
Then, I assign numpy.nan to everything that resulted in numpy.inf.
There's an example I posted in this stackoverflow discussion:
If anyone else comes across this issue, FWIW I employ the following solution pending a Pandas bug fix:
def np_sum(g):
return np.sum(g.values)
df = pd.DataFrame(data={'name': ['a', 'a', 'b', 'b', 'c'], 'data': [1, np.nan, 1, 1, np.nan]})
g = df.groupby(['name'])
g.agg(col1=('data', 'sum'), col2=('data', np_sum), col3=('data', np.sum), col4=('data', np.nansum))
col1 | col2 | col3 | col4
-- | -- | -- | --
1 | NaN | 1 | 1
2 | 2 | 2 | 2
0 | NaN | 0 | 0
The actual improvement to add skipna
support to groupby reductions is covered in https://github.com/pandas-dev/pandas/issues/15675, so I think this can be closed or repurposed to a documentation issue, to ensure we have clear documentation about this common tricky case.
Most helpful comment
I'm using latest v1.0.1 but still see this issue. Also the
min_count=1
argument seems to not work (for timedeltas at least). Any suggestions on how to keep thenan
in agroupy().sum()
?```
import pandas as pd
from datetime import datetime, date, timedelta
data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=2), None, timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
[date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
]
df = pd.DataFrame(data, columns = ['date', 'duration', 'total'])
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)
res=df.groupby(level=0).sum(min_count=1)
display(res)
Expected:
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00
But getting
date | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00
Found a workaround, namely to use
res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))
instead of
res=df.groupby(level=0).sum(min_count=1)