Pandas: Time-based .rolling() fails with .groupby()

Created on 11 Aug 2016  Â·  16Comments  Â·  Source: pandas-dev/pandas

Starting with this example:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

Bug Groupby Reshaping

Most helpful comment

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in __getattr__
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in __getitem__
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
*kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in __init__
super(GroupByMixin, self).__init__(obj, *args, *
kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in __init__
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows

All 16 comments

Agree that this should work!

Is there a good workaround?

Is the any way to do time aware rolling with group by for now before the new pandas release?

@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it's definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)

@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.

I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!

so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).

so this just needs tests

In [7]: pd.options.display.max_rows=12

In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'

In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
   ...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.concatenate((dates, dates)),
   ...:                    'C': np.arange(40)})
   ...: 

In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]: 
A  B                  
1  2016-01-01 09:30:00     0.0
   2016-01-01 09:30:01     0.5
   2016-01-01 09:30:02     1.0
   2016-01-01 09:30:03     1.5
   2016-01-01 09:30:04     2.5
   2016-01-01 09:30:05     3.5
                          ... 
3  2016-01-01 09:30:14    33.0
   2016-01-01 09:30:15    33.5
   2016-01-01 09:30:16    34.5
   2016-01-01 09:30:17    35.5
   2016-01-01 09:30:18    36.5
   2016-01-01 09:30:19    37.5
Name: C, Length: 40, dtype: float64

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in __getattr__
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in __getitem__
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
*kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in __init__
super(GroupByMixin, self).__init__(obj, *args, *
kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in __init__
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows

The same problem with @sronen71. Has it fixed?

Additionally, when using Dask;

ddf['K'] = ddf[['A', 'B', 'C']].groupby(by='A').apply(lambda x: x.rolling('90d', on='B')['C'].sum(), meta=('K', 'float64'))

Returns the ValueError: B must be monotonic error. The groupby is done in Dask, but the rolling is in Pandas land.

how about the progress?

Probably want to hope a new issue with a minimal example if there are
problems.

On Thu, Mar 14, 2019 at 5:36 AM lycanthropes notifications@github.com
wrote:

how about the progress?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/13966#issuecomment-472794962,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIvyTz3o5rB64DCox66u4MRJ9J5qcks5vWiYdgaJpZM4JiWlY
.

this issue is fixed

What I am looking for is to do monthly rolling. Now pandas still does not work with this issue. Such as:
df.groupby('stock_code')['ret_daily'].rolling(window='3M').std()

I can't use window='90D' instead, because sometimes it means a look-ahead bias in strategy backtesting.

Was this page helpful?
0 / 5 - 0 ratings