Pandas: Time-based .rolling() fails with .groupby()

Created on 11 Aug 2016 · 16Comments · Source: pandas-dev/pandas

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

Bug Groupby Reshaping

Source

chrisaycock

Most helpful comment

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in __getattr__
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in __getitem__
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
*kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in __init__
super(GroupByMixin, self).__init__(obj, *args, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in __init__
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows

sronen71 on 16 Dec 2017

👍2

All 16 comments

Agree that this should work!

jorisvandenbossche on 11 Aug 2016

Is there a good workaround?

hueykwik on 26 Nov 2016

Is the any way to do time aware rolling with group by for now before the new pandas release?

ShashankBharadwaj on 20 Dec 2016

@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.

jreback on 20 Dec 2016

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it's definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)

zscholl on 20 Apr 2017

🎉2

@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.

chrisaycock on 20 Apr 2017

I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!

zscholl on 20 Apr 2017

so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).

so this just needs tests

In [7]: pd.options.display.max_rows=12

In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'

In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
   ...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.concatenate((dates, dates)),
   ...:                    'C': np.arange(40)})
   ...: 

In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]: 
A  B                  
1  2016-01-01 09:30:00     0.0
   2016-01-01 09:30:01     0.5
   2016-01-01 09:30:02     1.0
   2016-01-01 09:30:03     1.5
   2016-01-01 09:30:04     2.5
   2016-01-01 09:30:05     3.5
                          ... 
3  2016-01-01 09:30:14    33.0
   2016-01-01 09:30:15    33.5
   2016-01-01 09:30:16    34.5
   2016-01-01 09:30:17    35.5
   2016-01-01 09:30:18    36.5
   2016-01-01 09:30:19    37.5
Name: C, Length: 40, dtype: float64

jreback on 20 Apr 2017

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

sronen71 on 16 Dec 2017

👍2

The same problem with @sronen71. Has it fixed?

guoliy on 27 Apr 2018

Additionally, when using Dask;

ddf['K'] = ddf[['A', 'B', 'C']].groupby(by='A').apply(lambda x: x.rolling('90d', on='B')['C'].sum(), meta=('K', 'float64'))

Returns the ValueError: B must be monotonic error. The groupby is done in Dask, but the rolling is in Pandas land.

morganics on 22 May 2018

how about the progress?

lycanthropes on 14 Mar 2019

Probably want to hope a new issue with a minimal example if there are
problems.

On Thu, Mar 14, 2019 at 5:36 AM lycanthropes notifications@github.com
wrote:

how about the progress?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/13966#issuecomment-472794962,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIvyTz3o5rB64DCox66u4MRJ9J5qcks5vWiYdgaJpZM4JiWlY
.

TomAugspurger on 14 Mar 2019

this issue is fixed

jreback on 14 Mar 2019

What I am looking for is to do monthly rolling. Now pandas still does not work with this issue. Such as:
df.groupby('stock_code')['ret_daily'].rolling(window='3M').std()

lycanthropes on 14 Mar 2019

I can't use window='90D' instead, because sometimes it means a look-ahead bias in strategy backtesting.

lycanthropes on 14 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings