Pandas: No way of keeping groupby column with ffill/bfill/pad

Created on 5 Aug 2019 · 8Comments · Source: pandas-dev/pandas

df = pd.DataFrame(0, index=[1,2,3,4], columns=['a', 'b', 'c'])
# Starting from 0.25 groupby pad behaves like other groupby functions in removing the grouping column
df.groupby(df.a).mean()
   b  c
a      
0  0  0
df.groupby(df.a).pad()
   b  c
1  0  0
2  0  0
3  0  0
4  0  0

# However for other functions it is possible to keep the group column using as_index=False
df.groupby(df.a, as_index=False).mean()
   a  b  c
0  0  0  0
# For ffill/bfill/pad instead the keyword as_index does not help as the column is not being used as the index anyway
df.groupby(df.a, as_index=False).pad()
   b  c
1  0  0
2  0  0
3  0  0
4  0  0
# There is no way of keeping the column, except creating a useless copy of the column series
df.groupby(df.a.copy()).pad()
   a  b  c
1  0  0  0
2  0  0  0
3  0  0  0
4  0  0  0

Problem description

Starting from 0.25 ffill behaves like other groupby functions in removing the group column. As far as I understand there is no non-hacky way of getting the old behaviour back. This is not consistent with what happens with other functions where the keyword argument as_index can be used to keep the grouping column.

The only way of having the old behaviour is to create a copy of the column. However this whole thing of view-dependent-behaviour is not only very confusing (at least to me) but it is also inefficient as it requires a useless copy. Moreover it is not a backward compatible solution.

I would suggest to either add a keep_group_columns that works consistently with all groupy functions or to add a special keyword only for groupby functions that keep the original index.

API Design Groupby

Source

goriccardo

Most helpful comment

@goriccardo Here's another maybe slightly more elegant way

In [19]: df.set_index('a').groupby(level='a').pad().reset_index()                                                                                                                                                                                   
Out[19]: 
   a  b  c
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0

jreback on 5 Aug 2019

👍3

All 8 comments

Another option would be changing the view-dependent-behaviour of groupby, making it consistent with e.g. set_index:

# If the column's name is passed (and the default drop=True is used) then the column is dropped
df.set_index('a')
   b  c
a      
0  0  0
0  0  0
0  0  0
0  0  0
# If a series is given as as input (no matter if it a view or not) then the column is kept
df.set_index(df.a)                                                                                                                                                                     
   a  b  c
a         
0  0  0  0
0  0  0  0
0  0  0  0
0  0  0  0

This behaviour would fix this issue as it would be sufficient to write:

df.groupby(df.a).pad()

to get the old behaviour and:

df.groupby('a').pad()

to get the new behaviour.

I would be glad to help writing a patch for making groupby behaviour consistent to set_index if there is an agreement on this.

goriccardo on 5 Aug 2019

@goriccardo you are confusing an aggregation with a transformation here.

transforms by-definition do not include the groupby's columns

In [3]: df.groupby('a').pad()                                                                                                                                                                                                                       
Out[3]: 
   b  c
1  0  0
2  0  0
3  0  0
4  0  0

In [4]: df.groupby('a').transform('mean')                                                                                                                                                                                                           
Out[4]: 
   b  c
1  0  0
2  0  0
3  0  0
4  0  0

see the docs at the end of this section: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation

jreback on 5 Aug 2019

Hi @jreback, thank you for your clarification. In case of a transformation what would be the right way to keep the column?

goriccardo on 5 Aug 2019

👍1

In [18]: df.groupby('a').pad().assign(a=df.a)                                                                                                                                                                                                       
Out[18]: 
   b  c  a
1  0  0  0
2  0  0  0
3  0  0  0
4  0  0  0

jreback on 5 Aug 2019

👎1 👍1

Okay, this construction was correct in 0.24.2 ->

df.groupby(key)\
   .bfill()\
   .groupby(key)\
   .apply(...)

now it's not bcs bfill loses the grouping key. And "Okay" it's understandable that you'd like to rollback to previous behavior but have some option to "don't lose possibly useful information" would be quite nice to have. @jreback any thoughts about it?

Current workaround for this operation is seems to be ->

df.groupby(key)\
   .apply(lambda x: x.bfill())

alimantu on 5 Aug 2019

Thank you @jreback , while I still believe it would be nice to have some explicit ways of keeping the column, your solution works fine for me.

Cheers

goriccardo on 5 Aug 2019

Going back to the OP however I wonder if we shouldn't repurpose this issue to raise when a user provides a non-default value for as_index with a transformation. The documentation states it only works for aggregation but maybe would have still been nice to signal in OP that something is awry up front

@goriccardo something you'd have interest in pursuing?

WillAyd on 5 Aug 2019

@goriccardo Here's another maybe slightly more elegant way

In [19]: df.set_index('a').groupby(level='a').pad().reset_index()                                                                                                                                                                                   
Out[19]: 
   a  b  c
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0

jreback on 5 Aug 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings