Pandas: Sum of pd.DataFrame.groupby.sum containing NaN should return NaN ?

Created on 13 Mar 2017 · 7Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
d = {'l':  ['left', 'right', 'left', 'right', 'left', 'right'],
     'r': ['right', 'left', 'right', 'left', 'right', 'left'],
     'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)

Problem description

When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum

NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum

In [235]: df.v.sum(skipna=False)
Out[235]: nan

However, this behavior is not reflected in the pandas.DataFrame.groupby object

In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0

and cannot be forced by applying the np.sum method directly

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0

see this StackOverflow post for a workaround

Expected Output

In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan

and

In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

Groupby Missing-data Usage Question

Source

flipdazed

Most helpful comment

you really think so?

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

jorisvandenbossche on 14 Mar 2017

👍3

All 7 comments

np.sum is translated directly to pandas sum. so this is as expected.

if you really really want this behavior of np.sum. This is not useful in any way IMHO.

In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan

In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]: 
l
left    -3.0
right    NaN

The passthru skipna parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).

jreback on 13 Mar 2017

xref https://github.com/pandas-dev/pandas/issues/15675

jreback on 13 Mar 2017

This is not useful in any way IMHO.

consider the following

merge two dataframes together from an SQl database
carry out data manipulation to get an aggregated figure per some index values that is dependent on all values being present

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

Either way, I agree with your assessment that it is inconsistent with current implementation as skipna should be a kwarg - many thanks for creating the new issue

flipdazed on 13 Mar 2017

@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter on groupby, or simply .fillna). but that is far less common than simply aggregating.

Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry

you really think so? sounds like you don't handle missing data at all.

jreback on 13 Mar 2017

you really think so?

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

jorisvandenbossche on 14 Mar 2017

👍3

Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.

sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.

jreback on 14 Mar 2017

It would be nice to have a keyword and get those NAs back in this case:
groupby(['x']).resample('D').sum()

When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').