Pandas: Sum of grouped bool column has inconsistent type

Created on 29 Apr 2014 · 4Comments · Source: pandas-dev/pandas

Summing a bool column after a groupby gives a bool result until there are two or more True values, when it becomes a float64. Seems like it should always be an (unsigned?) integer. Straight sum without a groupby always gives an int64. This is with 0.13.1.

pd.DataFrame([True]).groupby(lambda x: 0).sum()
      0
0  True

pd.DataFrame([True,True]).groupby(lambda x: 0).sum()
   0
0  2

pd.DataFrame([False]).groupby(lambda x: 0).sum()
       0
0  False

pd.DataFrame([False,False]).groupby(lambda x: 0).sum()
       0
0  False

pd.DataFrame([False,False,True]).groupby(lambda x: 0).sum()
      0
0  True

pd.DataFrame([False,False,True,True]).groupby(lambda x: 0).sum()
   0
0  2

pd.DataFrame([False,False]).sum()
0    0
dtype: int64

Bug Dtypes Groupby

Source

jkleint

👍3

Most helpful comment

This is really very confusing as it means some code might work well as expected on some data while running into an error on other data. I would much appreciate if this could be fixed.

aflugge on 12 Nov 2019

👍4

All 4 comments

this is a dupe of #3752, but I like your examples better, so will keep this issue!

Its possible to fix, but hasn't been high on the list of priorities

jreback on 29 Apr 2014

As for getting float64 instead of int64 as result, a possible workaround is to use count_nonzero from numpy instead of sum to aggregate:

>>> pd.DataFrame([True,True]).groupby(lambda x: 0).agg(pd.np.count_nonzero)[0]
0    2
Name: 0, dtype: int64

xflr6 on 14 Feb 2016

for some additional context - sometimes the user may not know they are dealing with a bool type. this may occur when performing a groupby on the result of pd.get_dummies, which may return columns of type uint8, but not always. if get_dummies returns a uint16, the issue above is not triggered, and dummies_result.groupby(...).sum() returns int types. if any of the counts in dummies is small enough, the groupby result will be float.

ediphy-dwild on 30 Oct 2018

This is really very confusing as it means some code might work well as expected on some data while running into an error on other data. I would much appreciate if this could be fixed.

aflugge on 12 Nov 2019

👍4

Was this page helpful?

0 / 5 - 0 ratings