Pandas: API: Table-wise rolling / expanding / EWM function application

Created on 10 Jan 2017  路  11Comments  路  Source: pandas-dev/pandas

In https://github.com/pandas-dev/pandas/pull/11603#issuecomment-162113949 (the main PR implementing the deferred API for rolling / expanding / ewm), we discussed how to specify table-wise applys. Groupby.apply(f) feeds the entire group (all columns) to f. For backwards-compatibility, .rolling(n).apply(f) needed to be column-wise.

https://github.com/pandas-dev/pandas/pull/11603#issuecomment-162116556 mentions a possible API like what I added for .style

  • axis=0: apply to each column independently
  • axis=1: apply to each row independently
  • axis=None: apply the supplied function to the entire table

So it'd be df.rolling(n).apply(f, axis=None).
Do people like the axis=0 / 1 / None idiom? Is it obvious enough?

This is prompted by @josef-pkt's post on the mailinglist. Needing a rolling OLS.

An example:

In [2]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: np.random.seed(0)
   ...: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=["A", "B"])
   ...: df
   ...:
Out[2]:
   A  B
0  5  0
1  3  3
2  7  9
3  3  5
4  2  4
5  7  6
6  8  8
7  1  6
8  7  7
9  8  1

For a concrete example, get the table-wise max (this is equivalent to df.rolling(4).max().max(1))

In [10]: df.rolling(4).apply(np.max, axis=None)
Out[10]:
0    NaN
1    NaN
2    NaN
3    9.0
4    9.0
5    9.0
6    8.0
7    8.0
8    8.0
9    8.0
dtype: float64

A real example is something like a rolling OLS:

import statsmodels.api as sm
f = lambda x: sm.OLS.from_formula('A ~ B', data=x).fit()  # wrong, but w/e

df.rolling(5).apply(f, axis=None)
API Design Window

All 11 comments

can u put up a simple example with the various options exercised? (e.g. simulate the output)

Updated with an example.

I also changed the suggested API: Before I had

df.rolling(n, axis=None).apply(f)

But really it should be

df.rolling(n).apply(f, axis=None).

The .rolling(axis=.) parameter controls the direction for rolling. The .rolling(...).apply(f, axis=.) parameter controls the axis for function application.

@TomAugspurger correct me if I am wrong, but what you really want is for .apply to be passed one of 2 cases.

  • a single column (now)
  • the entire table (option)

?
The other functions are only univariate so this doesn't matter.

but apply is pretty generic so we don't know what the user wants (but the original implementation was a single column)

You're correct.

This should make things clear

In [9]: def f(x):
   ...:     print(x)
   ...:     return 0

In [8]: df = pd.DataFrame(np.arange(9).reshape(3, 3))

In [14]: df
Out[14]:
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8


Currently, and the default in the future, this prints out

In [10]: df.rolling(2).apply(f)
[ 0.  3.]
[ 3.  6.]
[ 1.  4.]
[ 4.  7.]
[ 2.  5.]
[ 5.  8.]

With the new implementation and axis=None, the printed output would be

In [10]: df.rolling(2).apply(f, axis=None)
[[ 0  1, 2],  # first window; 2x3 array
 [ 3, 4, 5]]
[[ 3, 4, 5],  # second window; 2x3 array
  [6, 7, 8]]

@TomAugspurger I know you used axis=None this way in .style, but I personally find this a bit confusing.

I think its better to follow our current model, IOW

receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)

is very natural. This would be an API change, though even now I think we pass a ndarray.

another possibilty is to have return_type = 'frame', 'series', 'ndarray' (with a default of None, so that we can make this change easier).

I ran into a similar issue with a rolling function that uses OLS internally and needs to return more than one column (eg. the confidence interval).

Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?

Regarding API, I think the best way it should look like:

def f(narray):
    res = sm.OLS(narray, ...).fit()
    m_min, m_max = res.conf_int(0.05)[0]
    return m_min, m_max

# Single column
df.groupby(level=...)['column'].rolling(...).apply(lambda x: f(x))

def g(exogen, endogen):
    res = sm.OLS(exogen, endogen).fit()
    m_min, m_max = res.conf_int(0.05)[0]
    return m_min, m_max

# Multiple columns
df.groupby(level=...).rolling(...).apply(lambda x: g(x['exogen'], x['endogen']))

@dbivolaru

Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?

This is just an idea. You are welcome to submit a patch for this.

think its better to follow our current model, IOW
receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)
is very natural. This would be an API change, though even now I think we pass a ndarray.

I definitely agree with this - it fits well with everything else.

So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API? Or are we just waiting for a patch and and opportune moment to release?

So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API?

That's my opinion. We could maybe do this with a deprecation cycle with keywords.

2 thoughts here:

  1. I'm not sure if we should stuff this feature in the axis keyword; I think we should add a new parameter as I can see this being a possibility (from Tom's example). Maybe a how=None|'table' argument for None=1D, table=2D
# roll tablewise along rows
In [10]: df.rolling(2).apply(f, axis=0, how='table')
[[ 0  1, 2],  # first window; 2x3 array
 [ 3, 4, 5]]
[[ 3, 4, 5],  # second window; 2x3 array
  [6, 7, 8]]

# roll tablewise along columns
In [10]: df.rolling(2).apply(f, axis=1, how='table')
[[ 0  1,],  # first window; 3x2 array
 [ 3, 4,],
 [ 6, 7,]]
[[ 1  2,],  # second window; 3x2 array
 [ 4, 5,],
 [ 7, 8,]]
  1. Implementation wise, these might be some potential hurdles & complexities to consider:
  • Currently all windowing aggregations are calculated blockwise. This feature would probably need a dedicated code path that does the calculations over the rows/columns (easier if we eventually remove the block manager)
  • Currently, data types other than float or int are dropped. There's a consistency argument to align that with table-wide windowing but may render table wide rolling less useful if data is dropped.

A proposal for the implementation would be:

  1. Add a new keyword method='table'|'column' in the rolling/ewm/expanding method to specify whether we are rolling over a column or the entire object
  2. Requires the engine='numba' keyword to be set in the aggregation function (otherwise, the existing Cython aggregation functions need an overhaul
  3. Table-wise rolling requires a single float dtype
  4. (Mostly important for apply) the output of table-wise rolling will need to be 1 x number of columns for axis=0 and number of rows x 1 for axis=1

e.g.

df.rolling(2, method='table').apply(f, axis=1, engine='numba')
Was this page helpful?
0 / 5 - 0 ratings