Summing a bool column after a groupby gives a bool result until there are two or more True values, when it becomes a float64. Seems like it should always be an (unsigned?) integer. Straight sum without a groupby always gives an int64. This is with 0.13.1.
pd.DataFrame([True]).groupby(lambda x: 0).sum()
0
0 True
pd.DataFrame([True,True]).groupby(lambda x: 0).sum()
0
0 2
pd.DataFrame([False]).groupby(lambda x: 0).sum()
0
0 False
pd.DataFrame([False,False]).groupby(lambda x: 0).sum()
0
0 False
pd.DataFrame([False,False,True]).groupby(lambda x: 0).sum()
0
0 True
pd.DataFrame([False,False,True,True]).groupby(lambda x: 0).sum()
0
0 2
pd.DataFrame([False,False]).sum()
0 0
dtype: int64
this is a dupe of #3752, but I like your examples better, so will keep this issue!
Its possible to fix, but hasn't been high on the list of priorities
As for getting float64 instead of int64 as result, a possible workaround is to use count_nonzero from numpy instead of sum to aggregate:
>>> pd.DataFrame([True,True]).groupby(lambda x: 0).agg(pd.np.count_nonzero)[0]
0 2
Name: 0, dtype: int64
for some additional context - sometimes the user may not know they are dealing with a bool type. this may occur when performing a groupby on the result of pd.get_dummies, which may return columns of type uint8, but not always. if get_dummies returns a uint16, the issue above is not triggered, and dummies_result.groupby(...).sum() returns int types. if any of the counts in dummies is small enough, the groupby result will be float.
This is really very confusing as it means some code might work well as expected on some data while running into an error on other data. I would much appreciate if this could be fixed.
Most helpful comment
This is really very confusing as it means some code might work well as expected on some data while running into an error on other data. I would much appreciate if this could be fixed.