import pandas as pd
import numpy as np
d = {'l': ['left', 'right', 'left', 'right', 'left', 'right'],
'r': ['right', 'left', 'right', 'left', 'right', 'left'],
'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)
When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum
NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum
In [235]: df.v.sum(skipna=False)
Out[235]: nan
However, this behavior is not reflected in the pandas.DataFrame.groupby object
In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0
and cannot be forced by applying the np.sum method directly
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0
see this StackOverflow post for a workaround
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan
and
In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan
pd.show_versions()commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None
np.sum is translated directly to pandas sum. so this is as expected.
if you really really want this behavior of np.sum. This is not useful in any way IMHO.
In [15]: df.groupby('l')['v'].apply(lambda x: np.sum(np.array(x)))['right']
Out[15]: nan
or
In [18]: df.groupby('l')['v'].apply(lambda x: x.sum(skipna=False))
Out[18]:
l
left -3.0
right NaN
The passthru skipna parameter is not implemented ATM on groupby. So I will make an issue for that (thought we had one).
This is not useful in any way IMHO.
consider the following
Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry
Either way, I agree with your assessment that it is inconsistent with current implementation as skipna should be a kwarg - many thanks for creating the new issue
@flipdazed pandas propogates NaN values on purpose. Generally on aggregations you want to skip them. If you don't there are many options (e.g. look at .filter on groupby, or simply .fillna). but that is far less common than simply aggregating.
Should a database entry be missing for some aggregated index value, the final figure should be returned as NaN as missing data is an extremely common occurrence in industry
you really think so? sounds like you don't handle missing data at all.
you really think so?
Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.
Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.
sure, but the vast vast majority, you want to skipna. Its very uncommon in fact to assume ALL data is valid; that is my point.
It would be nice to have a keyword and get those NAs back in this case:
groupby(['x']).resample('D').sum()
When I resample after a groupby i need and aggregating function. it would be nice not to get zeros when I increase the resolution so I can use .fillna(method='ffill').
Most helpful comment
Jeff, there are certainly cases imaginable where you don't want to ignore missing values. And therefore we have that as a keyword.