In https://github.com/pandas-dev/pandas/pull/11603#issuecomment-162113949 (the main PR implementing the deferred API for rolling / expanding / ewm), we discussed how to specify table-wise applys. Groupby.apply(f) feeds the entire group (all columns) to f. For backwards-compatibility, .rolling(n).apply(f) needed to be column-wise.
https://github.com/pandas-dev/pandas/pull/11603#issuecomment-162116556 mentions a possible API like what I added for .style
axis=0: apply to each column independentlyaxis=1: apply to each row independentlyaxis=None: apply the supplied function to the entire tableSo it'd be df.rolling(n).apply(f, axis=None).
Do people like the axis=0 / 1 / None idiom? Is it obvious enough?
This is prompted by @josef-pkt's post on the mailinglist. Needing a rolling OLS.
An example:
In [2]: import numpy as np
...: import pandas as pd
...:
...: np.random.seed(0)
...: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=["A", "B"])
...: df
...:
Out[2]:
A B
0 5 0
1 3 3
2 7 9
3 3 5
4 2 4
5 7 6
6 8 8
7 1 6
8 7 7
9 8 1
For a concrete example, get the table-wise max (this is equivalent to df.rolling(4).max().max(1))
In [10]: df.rolling(4).apply(np.max, axis=None)
Out[10]:
0 NaN
1 NaN
2 NaN
3 9.0
4 9.0
5 9.0
6 8.0
7 8.0
8 8.0
9 8.0
dtype: float64
A real example is something like a rolling OLS:
import statsmodels.api as sm
f = lambda x: sm.OLS.from_formula('A ~ B', data=x).fit() # wrong, but w/e
df.rolling(5).apply(f, axis=None)
can u put up a simple example with the various options exercised? (e.g. simulate the output)
Updated with an example.
I also changed the suggested API: Before I had
df.rolling(n, axis=None).apply(f)
But really it should be
df.rolling(n).apply(f, axis=None).
The .rolling(axis=.) parameter controls the direction for rolling. The .rolling(...).apply(f, axis=.) parameter controls the axis for function application.
@TomAugspurger correct me if I am wrong, but what you really want is for .apply to be passed one of 2 cases.
?
The other functions are only univariate so this doesn't matter.
but apply is pretty generic so we don't know what the user wants (but the original implementation was a single column)
You're correct.
This should make things clear
In [9]: def f(x):
...: print(x)
...: return 0
In [8]: df = pd.DataFrame(np.arange(9).reshape(3, 3))
In [14]: df
Out[14]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Currently, and the default in the future, this prints out
In [10]: df.rolling(2).apply(f)
[ 0. 3.]
[ 3. 6.]
[ 1. 4.]
[ 4. 7.]
[ 2. 5.]
[ 5. 8.]
With the new implementation and axis=None, the printed output would be
In [10]: df.rolling(2).apply(f, axis=None)
[[ 0 1, 2], # first window; 2x3 array
[ 3, 4, 5]]
[[ 3, 4, 5], # second window; 2x3 array
[6, 7, 8]]
@TomAugspurger I know you used axis=None this way in .style, but I personally find this a bit confusing.
I think its better to follow our current model, IOW
receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)
is very natural. This would be an API change, though even now I think we pass a ndarray.
another possibilty is to have return_type = 'frame', 'series', 'ndarray' (with a default of None, so that we can make this change easier).
I ran into a similar issue with a rolling function that uses OLS internally and needs to return more than one column (eg. the confidence interval).
Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?
Regarding API, I think the best way it should look like:
def f(narray):
res = sm.OLS(narray, ...).fit()
m_min, m_max = res.conf_int(0.05)[0]
return m_min, m_max
# Single column
df.groupby(level=...)['column'].rolling(...).apply(lambda x: f(x))
def g(exogen, endogen):
res = sm.OLS(exogen, endogen).fit()
m_min, m_max = res.conf_int(0.05)[0]
return m_min, m_max
# Multiple columns
df.groupby(level=...).rolling(...).apply(lambda x: g(x['exogen'], x['endogen']))
@dbivolaru
Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?
This is just an idea. You are welcome to submit a patch for this.
think its better to follow our current model, IOW
receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)
is very natural. This would be an API change, though even now I think we pass a ndarray.
I definitely agree with this - it fits well with everything else.
So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API? Or are we just waiting for a patch and and opportune moment to release?
So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API?
That's my opinion. We could maybe do this with a deprecation cycle with keywords.
2 thoughts here:
axis keyword; I think we should add a new parameter as I can see this being a possibility (from Tom's example). Maybe a how=None|'table' argument for None=1D, table=2D# roll tablewise along rows
In [10]: df.rolling(2).apply(f, axis=0, how='table')
[[ 0 1, 2], # first window; 2x3 array
[ 3, 4, 5]]
[[ 3, 4, 5], # second window; 2x3 array
[6, 7, 8]]
# roll tablewise along columns
In [10]: df.rolling(2).apply(f, axis=1, how='table')
[[ 0 1,], # first window; 3x2 array
[ 3, 4,],
[ 6, 7,]]
[[ 1 2,], # second window; 3x2 array
[ 4, 5,],
[ 7, 8,]]
A proposal for the implementation would be:
method='table'|'column' in the rolling/ewm/expanding method to specify whether we are rolling over a column or the entire objectengine='numba' keyword to be set in the aggregation function (otherwise, the existing Cython aggregation functions need an overhaulapply) the output of table-wise rolling will need to be 1 x number of columns for axis=0 and number of rows x 1 for axis=1e.g.
df.rolling(2, method='table').apply(f, axis=1, engine='numba')