Pandas: API? how to do a "rolling groupby" or groupby with overlapping groups?

Created on 20 Jun 2019 · 11Comments · Source: pandas-dev/pandas

Sparked by @betatim's question on twitter: https://twitter.com/betatim/status/1141321049918906368

Suppose you have the following data:

repo_id = np.random.choice(np.arange(1000), 10000)
index = pd.Timestamp("2019-01-01") + pd.to_timedelta(np.random.randint(0, 24*30*6, size=10000), unit='H')
df = pd.DataFrame({'repo_id': repo_id}, index=index).sort_index()

In [111]: df.head() 
Out[111]: 
                     repo_id
2019-01-01 01:00:00      162
2019-01-01 01:00:00      850
2019-01-01 01:00:00      414
2019-01-01 02:00:00      753
2019-01-01 02:00:00      125

Data at somewhat randomly points in time (here hourly, but precision doesn't matter much).

Assume now you want to calculate a rolling 30D (monthly) statistic for each day.
For example the rolling number of unique values over a period of time (the elegant but somewhat verbose python code to do this: https://gist.github.com/betatim/c59039682d92fab89859358e8c585313)

A rolling operation does not exactly give what you want, because df.rolling() will calculate this 24H mean for each row, while there might be many rows for a single day, and you are only interested in 1 value for that day.
A groupby/resample operation can also not achieve this goal, as that could give monthly number of unique values, but it cannot shift this month period one day at a time, since the groups cannot be overlapping.

With rolling you can do something like df.rolling("30D").nunique().resample('D').last(). But this is 1) very slow, as the rolling in the first step is doing way too many calculations that you afterwards throw away with resample().last(), and 2) I am not sure the binning for timestamps within the same day happens fully as desired.

I found this an interesting use case, to think about if there are ways to better handle this in pandas, or how such an API could look like.

Enhancement Usage Question Window

Source

jorisvandenbossche

👍2 👀1

Most helpful comment

Another way to put it: it's a rolling operation, but instead of it row-by-row (calculating the statistic moving the window one row at a time), your window shifts with bigger jumps.
Only in this case, the jumps are not a regular number of rows, but related to a datetime frequency.

jorisvandenbossche on 20 Jun 2019

👍4

All 11 comments

I might be missing the point as I was having a hard time running code samples as is but wouldn't this do it for you?

df.groupby(pd.Grouper(freq='1D')).count().rolling('30D').sum()

If you get the number of daily transactions first before doing the rolling op should be much more efficient

WillAyd on 20 Jun 2019

I think something like that was suggested as well on twitter, but: that does not take the unique values over a 30D period, but only for a single day.
You cannot sum unique daily values to get unique monthly values .. (it's not just the number of "transactions", but the number of unique transactions. If it was only the number, yes, then that would have been additive).

jorisvandenbossche on 20 Jun 2019

So would #26958 solve this then? I think this will always be two steps (i.e. perform the rolling calc then resample / reshape as desired) from an API perspective

WillAyd on 20 Jun 2019

So would #26958 solve this then?

No, you can already do a "nunique" operation at the moment, but you need to type the verbose .apply(pd.Series.nunique, raw=False) for that. So I just opened that issue to make the shortcut rolling().nunique() also work.

I think this will always be two steps (i.e. perform the rolling calc then resample / reshape as desired) from an API perspective

With the current API, yes, you need two steps. But that's also why I opened this issue, because doing it in two steps is very non-performant. The possible solution I mentioned above like:

df.rolling("30D")['repo_id'].apply(lambda x: len(pd.unique(x)), raw=True).resample('D').last()
# df.rolling("30D").nunique().resample('D').last()  # after #26958

is kind of working, but is much slower than the function written by @betatim in https://gist.github.com/betatim/c59039682d92fab89859358e8c585313 (which just uses a python for loop over the groups that get selected with .loc and combine the results in a dataframe).
This is because the rolling step is doing way too many calculations that you afterwards throw away with resample().last()

jorisvandenbossche on 20 Jun 2019

The possible solution I mentioned above like:
df.rolling("30D")['repo_id'].apply(lambda x: len(pd.unique(x)), raw=True).resample('D').last()
is kind of working, but is _much_ slower than the function written by @betatim in

If so wouldn't the bottleneck there just be the apply op? If that is implemented I don't think performance would be a concern.

From an API perspective I can't imagine right now that rolling should do any kind of resampling or reduction on its own.

WillAyd on 20 Jun 2019

jorisvandenbossche on 20 Jun 2019

👍4

Another way to put it: it's a rolling operation, but instead of it row-by-row (calculating the statistic moving the window one row at a time), your window shifts with bigger jumps.
Only in this case, the jumps are not a regular number of rows, but related to a datetime frequency.

I still don't totally see how this should fit into the API design of rolling; seems like it should still be two separate ops to me to roll and group, and in that case I think your other PR regarding nunique to get good performance covers the need here.

If I'm mistaken though can you maybe provide a smaller reproducible example that highlights the expected values?

WillAyd on 21 Jun 2019

I've thought about this a bit more and I think the current API isn't just limited, it's wrong.
By my reasoning, when you ask for rolling(3), you're asking for a window of 3 _rows_ and the engine rolls one _row_ at a time so by analogy when you ask for rolling("3D"), since you're asking for a 3 _day_ window, the engine should roll 1 _day_ at a time.

Is there a precedent for this in other applications? Taking Postgres as an example by their definition

A window function performs a calculation across a set of table rows that are somehow related to the current row

https://www.postgresql.org/docs/current/tutorial-window.html

So curious to see what everyone else is doing.

Going back to the OP I'm still not entirely clear as to what is driving the larger discussion here. The main problem there seemed to be performance which is totally unrelated to the API design of rolling.

Why do we want to cram both the rolling and reshaping into one operation? By definition now rolling will return something that matches the size of the calling object, so with your proposals I _think_ we'd either be breaking that assumption or essentially broadcasting the result of those rolling operations across a frequency window. In either case I don't see why we wouldn't want to leave the latter operation to a resample or a transformation

WillAyd on 25 Jun 2019

Previous enhancement requests asking for a stride argument to rolling: #15354, #22976, https://github.com/pandas-dev/pandas/issues/27654#issue-474416717, https://github.com/dask/dask/issues/4659, https://github.com/numpy/numpy/issues/7753

ghost on 14 Jul 2019

👍2

Yup, I also encounter the same missing functionality. My proposed solution in #28302 also addresses this issue.

Currently, pd.Grouper ignores _loffset_, it uses _base_ for split boundary offset.

Ideally, there needs to be 3 parameters, loffset (label_offset, default=0, the starting boundary offset of every group label), gstart (group_start, default=loffset, the starting offset of every group's data) and gspan (group_span, default=1, the span of every group's data).

For example, if we want to split a time-series data from 9:00am every day to 9:00am the next day, while keeping the group label at 0am (so that the datetime object reduces to date, i.e., 2020-05-26 00:00:00 => 2020-05-26), however, within each group we want the past 2 days of data (i.e., from 9am two days before that day til 9am on that day) in addition to that day's data (i.e., from 9am on that day til 9am the next day), [in other words, there will be duplicate entries among adjacent groups and the total number of data rows will be tripled], then we should call pd.Grouper(freq='D', loffset=0, gstart=9.0/24-2, gspan=3). This will give rise to:
[('2020-01-01T00:00:00', <data from 2019-12-30T09:00:00 to 2020-01-02T09:00:00>), ('2020-01-02T00:00:00', <data from 2019-12-31T09:00:00 to 2020-01-03T09:00:00>), ('2020-01-03T00:00:00', <data from 2020-01-01T09:00:00 to 2020-01-04T09:00:00>), ('2020-01-04T00:00:00', <data from 2020-01-02T09:00:00 to 2020-01-05T09:00:00>), ...]

xuancong84 on 15 Jun 2020

Yup, I also encounter the same missing functionality. My proposed solution in #28302 also addresses this issue.

Currently, pd.Grouper ignores _loffset_, it uses _base_ for split boundary offset.

Ideally, there needs to be 3 parameters, loffset (label_offset, default=0, the starting boundary offset of every group label), gstart (group_start, default=loffset, the starting offset of every group's data) and gspan (group_span, default=1, the span of every group's data).

For example, if we want to split a time-series data from 9:00am every day to 9:00am the next day, while keeping the group label at 0am (so that the datetime object reduces to date, i.e., 2020-05-26 00:00:00 => 2020-05-26), however, within each group we want the past 2 days of data (i.e., from 9am two days before that day til 9am on that day) in addition to that day's data (i.e., from 9am on that day til 9am the next day), [in other words, there will be duplicate entries among adjacent groups and the total number of data rows will be tripled], then we should call pd.Grouper(freq='D', loffset=0, gstart=9.0/24-2, gspan=3). This will give rise to:
[('2020-01-01T00:00:00', <data from 2019-12-30T09:00:00 to 2020-01-02T09:00:00>), ('2020-01-02T00:00:00', <data from 2019-12-31T09:00:00 to 2020-01-03T09:00:00>), ('2020-01-03T00:00:00', <data from 2020-01-01T09:00:00 to 2020-01-04T09:00:00>), ('2020-01-04T00:00:00', <data from 2020-01-02T09:00:00 to 2020-01-05T09:00:00>), ...]

Thanks for sharing your insights. Any idea to get the workaround working for the following scenario: https://stackoverflow.com/questions/62351499/how-to-group-pandas-dataframe-by-date-with-overlapping-sliding-window