pandas groupby sum min_count misbehaves

Created on 25 Nov 2018  路  12Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)

Problem description

My hope is that, by using min_count=1, pandas will return NaN when all columns being summed up are NaN. However, now it is returning 0 instead of NaN. Any idea why?

Actual output:

```
col1 col2
0 0.0 3.0
1 0.0 4.0


**Note**: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

**Note**: Many problems can be resolved by simply upgrading `pandas` to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if `master` addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on `master` here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

#### Expected Output

col1 col2
0 Nan 3.0
1 Nan 4.0
```

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Groupby Missing-data Numeric

Most helpful comment

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

All 12 comments

Thanks for the report. Investigation and PRs always welcome

Hey, could you explain the problem in details and in what cases having NaNs as an output would be helpful? As we could just add an extra condition to make it work. I just can't understand any good application where this could be used.

I think having NaN is what min_count parameter is supposed to behave according to the documentation, right?

min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

I find the current behavior of returning 0 when all values are NaN really strange. Why would the default be to return 0 if missing all data? Seems sort of arbitrary (you could pick any number, not just 0 as the default) and counterintuitive.

Suppose I have data containing the income of individuals over time. The individuals are grouped into families and I want to get the total income of the family. The current behavior would assign a value of 0 to families where all individuals have missing income information. Because its very possible some families don't have any income (all individuals with income of 0), seeing a bunch of 0s in the resulting family income series wouldn't be too strange. You'd then interpret all the zeros as reflecting real 0s and would proceed in the analysis without knowing your data is corrupted.

Also, to echo @ginward, what, then, is the purpose of the min_count argument?

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame(columns=['a'])
>>> df
Empty DataFrame
Columns: [a]
Index: []
>>> df.sum()
a    0.0
dtype: float64
>>> df.sum(min_count=1)
a   NaN
dtype: float64

UPDATE: It seems like my original critique no longer applies at all, even for groupby. I guess the documentation should be updated?

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame([[1, np.nan],[1,np.nan], [2,4]], columns=['a', 'b'])
>>> df.groupby(by='a')['b'].sum()
a
1    0.0
2    4.0
Name: b, dtype: float64
>>> df.groupby(by='a')['b'].sum(min_count=1)
a
1    NaN
2    4.0
Name: b, dtype: float64

Documentation should omit this part:
"""
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
"""

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

How does your example show that? The output of Series([]).sum() and Series([np.nan]).sum() are 0, as documented.

I misinterpreted the documentation. Because the previous bug (which at least applies to version '0.24.2') was to return 0 regardless of min_count (for groupby sum), I interpreted default to mean it would return 0 if all were NaN, which I found really strange.

So I think all is well then in version '0.25.1'. Should this issue be closed then?

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

For anyone else who ends up here because of problems using groupby sum and min_count with decimal values @wbijster 's workaround also works if you are summing over a column containing decimals 馃帀

For my purposes I actually wanted the groupby sum over decimal values to be NaN if any NaNs were present in the series so I used

def sumgroup(s):
    s = s.sum(skipna=False)
    return s

(Using pandas 0.25.3 btw)

The workaround described by @AngelaO does work, but it's much slower than .groupby().sum(). I would really appreciate if sum() on GroupBys honors the min_count parameter.

Was this page helpful?
0 / 5 - 0 ratings