df = pd.DataFrame(0, index=[1,2,3,4], columns=['a', 'b', 'c'])
# Starting from 0.25 groupby pad behaves like other groupby functions in removing the grouping column
df.groupby(df.a).mean()
b c
a
0 0 0
df.groupby(df.a).pad()
b c
1 0 0
2 0 0
3 0 0
4 0 0
# However for other functions it is possible to keep the group column using as_index=False
df.groupby(df.a, as_index=False).mean()
a b c
0 0 0 0
# For ffill/bfill/pad instead the keyword as_index does not help as the column is not being used as the index anyway
df.groupby(df.a, as_index=False).pad()
b c
1 0 0
2 0 0
3 0 0
4 0 0
# There is no way of keeping the column, except creating a useless copy of the column series
df.groupby(df.a.copy()).pad()
a b c
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Starting from 0.25 ffill behaves like other groupby functions in removing the group column. As far as I understand there is no non-hacky way of getting the old behaviour back. This is not consistent with what happens with other functions where the keyword argument as_index can be used to keep the grouping column.
The only way of having the old behaviour is to create a copy of the column. However this whole thing of view-dependent-behaviour is not only very confusing (at least to me) but it is also inefficient as it requires a useless copy. Moreover it is not a backward compatible solution.
I would suggest to either add a keep_group_columns that works consistently with all groupy functions or to add a special keyword only for groupby functions that keep the original index.
Another option would be changing the view-dependent-behaviour of groupby, making it consistent with e.g. set_index:
# If the column's name is passed (and the default drop=True is used) then the column is dropped
df.set_index('a')
b c
a
0 0 0
0 0 0
0 0 0
0 0 0
# If a series is given as as input (no matter if it a view or not) then the column is kept
df.set_index(df.a)
a b c
a
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
This behaviour would fix this issue as it would be sufficient to write:
df.groupby(df.a).pad()
to get the old behaviour and:
df.groupby('a').pad()
to get the new behaviour.
I would be glad to help writing a patch for making groupby behaviour consistent to set_index if there is an agreement on this.
@goriccardo you are confusing an aggregation with a transformation here.
transforms by-definition do not include the groupby's columns
In [3]: df.groupby('a').pad()
Out[3]:
b c
1 0 0
2 0 0
3 0 0
4 0 0
In [4]: df.groupby('a').transform('mean')
Out[4]:
b c
1 0 0
2 0 0
3 0 0
4 0 0
see the docs at the end of this section: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#transformation
Hi @jreback, thank you for your clarification. In case of a transformation what would be the right way to keep the column?
In [18]: df.groupby('a').pad().assign(a=df.a)
Out[18]:
b c a
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Okay, this construction was correct in 0.24.2 ->
df.groupby(key)\
.bfill()\
.groupby(key)\
.apply(...)
now it's not bcs bfill loses the grouping key. And "Okay" it's understandable that you'd like to rollback to previous behavior but have some option to "don't lose possibly useful information" would be quite nice to have. @jreback any thoughts about it?
Current workaround for this operation is seems to be ->
df.groupby(key)\
.apply(lambda x: x.bfill())
Thank you @jreback , while I still believe it would be nice to have some explicit ways of keeping the column, your solution works fine for me.
Cheers
Going back to the OP however I wonder if we shouldn't repurpose this issue to raise when a user provides a non-default value for as_index with a transformation. The documentation states it only works for aggregation but maybe would have still been nice to signal in OP that something is awry up front
@goriccardo something you'd have interest in pursuing?
@goriccardo Here's another maybe slightly more elegant way
In [19]: df.set_index('a').groupby(level='a').pad().reset_index()
Out[19]:
a b c
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
Most helpful comment
@goriccardo Here's another maybe slightly more elegant way