Starting with this example:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
I can easily compute the rolling mean by identifier:
In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
...
2 30 28.5
31 29.5
3 32 NaN
33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now I want to add a timestamp column:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
The timestamps are ordered within each identifier, but pandas complains:
In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic
Re-sorting leads to a different error:
In [26]: df.sort_values('B', inplace=True)
In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None
But we know that these column names are valid:
n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0 0.000000
20 10.000000
1 7.000000
21 10.500000
2 8.800000
22 11.000000
3 9.857143
23 11.500000
4 10.857143
24 12.500000
...
35 24.714286
15 23.500000
36 25.714286
16 24.500000
37 26.714286
17 25.500000
38 27.714286
18 26.500000
19 25.857143
39 27.500000
Name: C, dtype: float64
It seems like a bug that time-based .rolling() does not work with .groupby().
Agree that this should work!
Is there a good workaround?
Is the any way to do time aware rolling with group by for now before the new pandas release?
@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.
I stumbled on this yesterday as I was trying to solve the same problem.
I found a workaround, it's definitely not efficient, but it works.
import pandas as pd
import numpy as np
%load_ext watermark
%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0
pandas 0.19.2
numpy 1.11.0
compiler : MSC v.1900 64 bit (AMD64)
system : Windows
release : 7
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores : 8
interpreter: 64bit
# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
# sort on the datetime column
df.sort_values('B', inplace=True)
#group it
grouped = df.groupby('A')
# tmp array to hold frames
frames = []
for group in grouped.groups:
frame = grouped.get_group(group)
frame['avg'] = frame.rolling('4s', on='B').C.mean()
frames.append(frame)
pd.concat(frames)
Out[18]:
A B C avg
0 1 2016-01-01 09:30:00 0 0.0
1 1 2016-01-01 09:30:01 1 0.5
2 1 2016-01-01 09:30:02 2 1.0
3 1 2016-01-01 09:30:03 3 1.5
4 1 2016-01-01 09:30:04 4 2.5
5 1 2016-01-01 09:30:05 5 3.5
6 1 2016-01-01 09:30:06 6 4.5
7 1 2016-01-01 09:30:07 7 5.5
8 1 2016-01-01 09:30:08 8 6.5
9 1 2016-01-01 09:30:09 9 7.5
10 1 2016-01-01 09:30:10 10 8.5
11 1 2016-01-01 09:30:11 11 9.5
12 1 2016-01-01 09:30:12 12 10.5
13 1 2016-01-01 09:30:13 13 11.5
14 1 2016-01-01 09:30:14 14 12.5
15 1 2016-01-01 09:30:15 15 13.5
16 1 2016-01-01 09:30:16 16 14.5
17 1 2016-01-01 09:30:17 17 15.5
18 1 2016-01-01 09:30:18 18 16.5
19 1 2016-01-01 09:30:19 19 17.5
20 2 2016-01-01 09:30:00 20 20.0
21 2 2016-01-01 09:30:01 21 20.5
22 2 2016-01-01 09:30:02 22 21.0
23 2 2016-01-01 09:30:03 23 21.5
24 2 2016-01-01 09:30:04 24 22.5
25 2 2016-01-01 09:30:05 25 23.5
26 2 2016-01-01 09:30:06 26 24.5
27 2 2016-01-01 09:30:07 27 25.5
28 2 2016-01-01 09:30:08 28 26.5
29 2 2016-01-01 09:30:09 29 27.5
30 2 2016-01-01 09:30:10 30 28.5
31 2 2016-01-01 09:30:11 31 29.5
32 3 2016-01-01 09:30:12 32 32.0
33 3 2016-01-01 09:30:13 33 32.5
34 3 2016-01-01 09:30:14 34 33.0
35 3 2016-01-01 09:30:15 35 33.5
36 3 2016-01-01 09:30:16 36 34.5
37 3 2016-01-01 09:30:17 37 35.5
38 3 2016-01-01 09:30:18 38 36.5
39 3 2016-01-01 09:30:19 39 37.5
Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)
@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.
I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!
so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).
so this just needs tests
In [7]: pd.options.display.max_rows=12
In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'
In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
...: 'B': np.concatenate((dates, dates)),
...: 'C': np.arange(40)})
...:
In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]:
A B
1 2016-01-01 09:30:00 0.0
2016-01-01 09:30:01 0.5
2016-01-01 09:30:02 1.0
2016-01-01 09:30:03 1.5
2016-01-01 09:30:04 2.5
2016-01-01 09:30:05 3.5
...
3 2016-01-01 09:30:14 33.0
2016-01-01 09:30:15 33.5
2016-01-01 09:30:16 34.5
2016-01-01 09:30:17 35.5
2016-01-01 09:30:18 36.5
2016-01-01 09:30:19 37.5
Name: C, Length: 40, dtype: float64
pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:
Traceback (most recent call last):
File "
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in __getattr__
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in __getitem__
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
*kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in __init__
super(GroupByMixin, self).__init__(obj, *args, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in __init__
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows
The same problem with @sronen71. Has it fixed?
Additionally, when using Dask;
ddf['K'] = ddf[['A', 'B', 'C']].groupby(by='A').apply(lambda x: x.rolling('90d', on='B')['C'].sum(), meta=('K', 'float64'))
Returns the ValueError: B must be monotonic error. The groupby is done in Dask, but the rolling is in Pandas land.
how about the progress?
Probably want to hope a new issue with a minimal example if there are
problems.
On Thu, Mar 14, 2019 at 5:36 AM lycanthropes notifications@github.com
wrote:
how about the progress?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/13966#issuecomment-472794962,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIvyTz3o5rB64DCox66u4MRJ9J5qcks5vWiYdgaJpZM4JiWlY
.
this issue is fixed
What I am looking for is to do monthly rolling. Now pandas still does not work with this issue. Such as:
df.groupby('stock_code')['ret_daily'].rolling(window='3M').std()
I can't use window='90D' instead, because sometimes it means a look-ahead bias in strategy backtesting.
Most helpful comment
pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:
Traceback (most recent call last):", line 1, in
File "
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in __getattr__
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in __getitem__
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
*kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in __init__
super(GroupByMixin, self).__init__(obj, *args, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in __init__
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows