I found another oddity while digging through #13966.
Begin with the initial DataFrame in that issue:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
Save the grouping:
In [215]: g = df.groupby('A')
Compute the rolling sum:
In [216]: r = g.rolling(4)
In [217]: r.sum()
Out[217]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
5 4.0 14.0
6 4.0 18.0
7 4.0 22.0
8 4.0 26.0
9 4.0 30.0
... ... ...
2 30 8.0 114.0
31 8.0 118.0
3 32 NaN NaN
33 NaN NaN
34 NaN NaN
35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
It maintains the by
column (A
)! That column should not be in the resulting DataFrame.
It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by
column is gone as expected:
In [218]: g.sum()
Out[218]:
B
A
1 190
2 306
3 284
In [219]: r.sum()
Out[219]:
B
A
1 0 NaN
1 NaN
2 NaN
3 6.0
4 10.0
5 14.0
6 18.0
7 22.0
8 26.0
9 30.0
... ...
2 30 114.0
31 118.0
3 32 NaN
33 NaN
34 NaN
35 134.0
36 138.0
37 142.0
38 146.0
39 150.0
[40 rows x 1 columns]
So the grouping summation has some sort of side effect.
A little note while digging through more code: _convert_grouper
in groupby.py
has:
if isinstance(grouper, dict):
...
elif isinstance(grouper, Series):
...
elif isinstance(grouper, (list, Series, Index, np.ndarray)):
...
else:
...
The grouper
is compared twice to Series
. I will fix this when I clean-up everything.
I can fix the issue if I set the group selection:
g._set_group_selection()
I think we need this function at the start of .rolling()
.
Seems similar to #12839
This is defined behavior; in, that it is identical to .apply
on the groupby.
In [10]: df.groupby('A').rolling(4).sum()
Out[10]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
... ... ...
3 35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]:
A B
A
1 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 6.0
4 4.0 10.0
... ... ...
3 35 12.0 134.0
36 12.0 138.0
37 12.0 142.0
38 12.0 146.0
39 12.0 150.0
[40 rows x 2 columns]
you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.
Hmm:
In [617]: df.groupby('A').sum()
Out[617]:
B
A
1 190
2 306
3 284
In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
A B
A
1 20 190
2 24 306
3 24 284
In addition to .rolling()
and .apply()
, .ohlc()
and .expanding()
keep the by
column following a .groupby()
.
on reread this should be consistent - so marking as a bug
prob should not include the grouping column/level even though apply does
A similar thing happens with index columns.
from pandas import DataFrame, Timestamp
c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})
In [27]: c
Out[27]:
dl_payload ul_payload
a 2016-11-01 06:00:00 14 4
2016-11-01 06:15:00 15 5
2016-11-01 07:15:00 16 6
2016-11-01 07:30:00 17 7
2016-11-01 07:45:00 18 8
2016-11-01 09:00:00 19 9
In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
dl_payload ul_payload
a a 2016-11-01 06:00:00 NaN NaN
2016-11-01 06:15:00 NaN NaN
2016-11-01 07:15:00 45.0 15.0
2016-11-01 07:30:00 48.0 18.0
2016-11-01 07:45:00 51.0 21.0
2016-11-01 09:00:00 54.0 24.0
But not with group_keys=False
:
In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
dl_payload ul_payload
a 2016-11-01 06:00:00 NaN NaN
2016-11-01 06:15:00 NaN NaN
2016-11-01 07:15:00 45.0 15.0
2016-11-01 07:30:00 48.0 18.0
2016-11-01 07:45:00 51.0 21.0
2016-11-01 09:00:00 54.0 24.0
Why is the issue closed? The problem is still there (pandas 0.24.2).
this is closed in 0.25 coming soon
Still the same problem in 0.25.
Workaround:
df.groupby('A').rolling(4).sum().reset_index(level=0, drop=True)
The problem still exists in v1.0.1
Most helpful comment
The problem still exists in v1.0.1