Pandas: Rolling groupby should not maintain the by column in the resulting DataFrame

Created on 16 Aug 2016 · 10Comments · Source: pandas-dev/pandas

I found another oddity while digging through #13966.

Begin with the initial DataFrame in that issue:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

Save the grouping:

In [215]: g = df.groupby('A')

Compute the rolling sum:

In [216]: r = g.rolling(4)

In [217]: r.sum()
Out[217]:
         A      B
A
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
  5    4.0   14.0
  6    4.0   18.0
  7    4.0   22.0
  8    4.0   26.0
  9    4.0   30.0
...    ...    ...
2 30   8.0  114.0
  31   8.0  118.0
3 32   NaN    NaN
  33   NaN    NaN
  34   NaN    NaN
  35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

It maintains the by column (A)! That column should not be in the resulting DataFrame.

It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by column is gone as expected:

In [218]: g.sum()
Out[218]:
     B
A
1  190
2  306
3  284

In [219]: r.sum()
Out[219]:
          B
A
1 0     NaN
  1     NaN
  2     NaN
  3     6.0
  4    10.0
  5    14.0
  6    18.0
  7    22.0
  8    26.0
  9    30.0
...     ...
2 30  114.0
  31  118.0
3 32    NaN
  33    NaN
  34    NaN
  35  134.0
  36  138.0
  37  142.0
  38  146.0
  39  150.0

[40 rows x 1 columns]

So the grouping summation has some sort of side effect.

API Design Bug Groupby Reshaping

Source

chrisaycock

Most helpful comment

The problem still exists in v1.0.1

KraxelHuber on 20 Feb 2020

👍5

All 10 comments

A little note while digging through more code: _convert_grouper in groupby.py has:

    if isinstance(grouper, dict):
        ...
    elif isinstance(grouper, Series):
        ...
    elif isinstance(grouper, (list, Series, Index, np.ndarray)):
        ...
    else:
        ...

The grouper is compared twice to Series. I will fix this when I clean-up everything.

chrisaycock on 16 Aug 2016

I can fix the issue if I set the group selection:

g._set_group_selection()

I think we need this function at the start of .rolling().

Seems similar to #12839

chrisaycock on 17 Aug 2016

This is defined behavior; in, that it is identical to .apply on the groupby.

In [10]: df.groupby('A').rolling(4).sum()
Out[10]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.

jreback on 17 Aug 2016

Hmm:

In [617]: df.groupby('A').sum()
Out[617]:
     B
A
1  190
2  306
3  284

In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
    A    B
A
1  20  190
2  24  306
3  24  284

In addition to .rolling() and .apply(), .ohlc() and .expanding() keep the by column following a .groupby().

chrisaycock on 17 Aug 2016

on reread this should be consistent - so marking as a bug
prob should not include the grouping column/level even though apply does

jreback on 1 Sep 2016

👍1

A similar thing happens with index columns.

from pandas import DataFrame, Timestamp

c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})

In [27]: c
Out[27]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00          14           4
  2016-11-01 06:15:00          15           5
  2016-11-01 07:15:00          16           6
  2016-11-01 07:30:00          17           7
  2016-11-01 07:45:00          18           8
  2016-11-01 09:00:00          19           9

In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
                         dl_payload  ul_payload
a a 2016-11-01 06:00:00         NaN         NaN
    2016-11-01 06:15:00         NaN         NaN
    2016-11-01 07:15:00        45.0        15.0
    2016-11-01 07:30:00        48.0        18.0
    2016-11-01 07:45:00        51.0        21.0
    2016-11-01 09:00:00        54.0        24.0

But not with group_keys=False:

In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00         NaN         NaN
  2016-11-01 06:15:00         NaN         NaN
  2016-11-01 07:15:00        45.0        15.0
  2016-11-01 07:30:00        48.0        18.0
  2016-11-01 07:45:00        51.0        21.0
  2016-11-01 09:00:00        54.0        24.0