d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)
My hope is that, by using min_count=1, pandas will return NaN when all columns being summed up are NaN. However, now it is returning 0 instead of NaN. Any idea why?
```
col1 col2
0 0.0 3.0
1 0.0 4.0
**Note**: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
**Note**: Many problems can be resolved by simply upgrading `pandas` to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if `master` addresses this issue, but that is not necessary.
For documentation-related issues, you can check the latest versions of the docs on `master` here:
https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
#### Expected Output
col1 col2
0 Nan 3.0
1 Nan 4.0
```
pd.show_versions()[paste the output of pd.show_versions() here below this line]
commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Thanks for the report. Investigation and PRs always welcome
Hey, could you explain the problem in details and in what cases having NaNs as an output would be helpful? As we could just add an extra condition to make it work. I just can't understand any good application where this could be used.
I think having NaN is what min_count parameter is supposed to behave according to the documentation, right?
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:
def sumgroup(s):
s = s.sum(min_count=1)
return(s)
dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2
I find the current behavior of returning 0 when all values are NaN really strange. Why would the default be to return 0 if missing all data? Seems sort of arbitrary (you could pick any number, not just 0 as the default) and counterintuitive.
Suppose I have data containing the income of individuals over time. The individuals are grouped into families and I want to get the total income of the family. The current behavior would assign a value of 0 to families where all individuals have missing income information. Because its very possible some families don't have any income (all individuals with income of 0), seeing a bunch of 0s in the resulting family income series wouldn't be too strange. You'd then interpret all the zeros as reflecting real 0s and would proceed in the analysis without knowing your data is corrupted.
Also, to echo @ginward, what, then, is the purpose of the min_count argument?
Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame(columns=['a'])
>>> df
Empty DataFrame
Columns: [a]
Index: []
>>> df.sum()
a 0.0
dtype: float64
>>> df.sum(min_count=1)
a NaN
dtype: float64
UPDATE: It seems like my original critique no longer applies at all, even for groupby. I guess the documentation should be updated?
>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame([[1, np.nan],[1,np.nan], [2,4]], columns=['a', 'b'])
>>> df.groupby(by='a')['b'].sum()
a
1 0.0
2 4.0
Name: b, dtype: float64
>>> df.groupby(by='a')['b'].sum(min_count=1)
a
1 NaN
2 4.0
Name: b, dtype: float64
Documentation should omit this part:
"""
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
"""
Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:
How does your example show that? The output of Series([]).sum() and Series([np.nan]).sum() are 0, as documented.
I misinterpreted the documentation. Because the previous bug (which at least applies to version '0.24.2') was to return 0 regardless of min_count (for groupby sum), I interpreted default to mean it would return 0 if all were NaN, which I found really strange.
So I think all is well then in version '0.25.1'. Should this issue be closed then?
FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:
def sumgroup(s): s = s.sum(min_count=1) return(s) dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2
For anyone else who ends up here because of problems using groupby sum and min_count with decimal values @wbijster 's workaround also works if you are summing over a column containing decimals 馃帀
For my purposes I actually wanted the groupby sum over decimal values to be NaN if any NaNs were present in the series so I used
def sumgroup(s):
s = s.sum(skipna=False)
return s
(Using pandas 0.25.3 btw)
The workaround described by @AngelaO does work, but it's much slower than .groupby().sum(). I would really appreciate if sum() on GroupBys honors the min_count parameter.
Most helpful comment
FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now: