Pandas: R rolling join equivalent in Pandas

Created on 22 Jun 2014 · 7Comments · Source: pandas-dev/pandas

Hi,

In R, there is a way to perform a rolling join as shown below.

library("data.table")
DT <-
data.table(
x = rep(c("a", "b", "c"), each = 3),
y = c(1, 3, 6),
v = 1:9)
setkey(DT, x, y) # Necessary for following self join.
DT[J("a", 4:6)]
DT[J("a", 4:6), roll = TRUE]

The relevant output looks like this (J is shorthand for self join):

DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
DT[J("a", 4:6)] # v columns does not have "2".
x y v
1: a 4 NA
2: a 5 NA
3: a 6 3
DT[J("a", 4:6), roll = TRUE] # v column rolls "2" forward.
x y v
1: a 4 2
2: a 5 2
3: a 6 3

In Pandas, the closest that I could think of is to use Join then DataFrame,fillna(method='pad') . But this will not yield the above R rolling join result since the first 2 rows would be NA in the joined column (it starts with (a,3) and not (a,2))

I am just wondering whether there is an equivalent operation in Pandas for this?

Kind regards,
Kris

Reshaping

Source

krismetha

All 7 comments

see https://github.com/pydata/pandas/issues/2996 and #5401

jreback on 22 Jun 2014

Hi,

I have taken a look at both of the above mentioned issues, but it is not really what this question is about.

Below is an equivalent code for Python using Pandas, as my first post was for R

import pandas as pd, numpy as np

df = pd.DataFrame( np.hstack(([['a']]_3 + [['b']]_3 +[['c']]*3 , [[1],[3],[6]] * 3 , [ [i] for i in range(1,10)])))
df = df.set_index([0,1])

df2 = pd.DataFrame( np.hstack(([['a']]*3 , [ [i] for i in range(4,7)] )))
df2 = df2.set_index([0,1])

So my question is whether there is an operator that is equivalent to R join argument roll = True above?

eg, something like df2.join(df, roll=True), and this gives:

the 2 in a,4 and a,5 comes from the fact that the original df.ix[a,3] is 2, which is the index right before df.ix[a,6] . In other words, this is a rolling join - rolling upward.

Looking at this another way, rolling join is similar to using join, then fillna(method='pad') but the slight difference is that the first few rows would still end up being NA, like the above case.

Thanks again,
Kris

krismetha on 27 Jun 2014

closing as stale. pls reopen if still an issue.

jreback on 21 Oct 2015

😕1

Rolling join is _not_ `join` + `fillna`

A rolling join is _not_ the same as a join and a fillna! That would only be the case if the key of the table that is joined against (in terms of data.table that would be the left table and a right-join) has equivalents in the main table. A data.table rolling join does not require this.

Solution in pandas:

There is a solution though in pandas. Let's assume your right data.table is table A and your left data.table is table B.

Sort the table A and and B each by key.
Add a column tag to A which are all 0 and a column tag to B that are all 1.
Delete all columns except the key and tagfrom B (can be omitted, but it is clearer this way) and call the table B'. Keep B as an original - we are going to need it later.
Concatenate A with B' to C and ignore the fact that the rows from B' has many NAs.
Sort C by key.
Make a new cumsum column with C = C.assign(groupNr = np.cumsum(C.tag))
Using filtering (query) on tag get rid of all B'-rows.
Add a running counter column groupNr to the original B (integers from 0 to N-1 or from 1 to N, depending on whether you want forward or backward rolling join).
Join B with C on groupNr.

Make42 on 21 Apr 2017

👍1

R newbie here. @Make42 Could you please provide short python code snippet with all steps or at least clarify step 7 because I am not sure I get it.