Pandas: Center rolling window with date NotImplementedError

Created on 6 Mar 2018  路  12Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

idx = pd.date_range(start="2010-01-01", end="2010-01-30", fred="1D")
df = pd.DataFrame(index=idx, data=np.sin(np.linspace(0,2*np.pi,len(idx))))

# works well
df.rolling(5, center=True, min_periods=5)

# fail with 
# NotImplementedError: center is not implemented for datetimelike and offset based windows
df.rolling("5D", center=True, min_periods=5)

Complete traceback:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-54-058f86b11afc> in <module>()
----> 1 df.rolling("5D", center=True, min_periods=5)

~/.local/lib/python3.5/site-packages/pandas/core/generic.py in rolling(self, window, min_periods, freq, center, win_type, on, axis, closed)
   7063                                    min_periods=min_periods, freq=freq,
   7064                                    center=center, win_type=win_type,
-> 7065                                    on=on, axis=axis, closed=closed)
   7066 
   7067         cls.rolling = rolling

~/.local/lib/python3.5/site-packages/pandas/core/window.py in rolling(obj, win_type, **kwds)
   2064         return Window(obj, win_type=win_type, **kwds)
   2065 
-> 2066     return Rolling(obj, **kwds)
   2067 
   2068 

~/.local/lib/python3.5/site-packages/pandas/core/window.py in __init__(self, obj, window, min_periods, freq, center, win_type, axis, on, closed, **kwargs)
     84         self.win_freq = None
     85         self.axis = obj._get_axis_number(axis) if axis is not None else None
---> 86         self.validate()
     87 
     88     @property

~/.local/lib/python3.5/site-packages/pandas/core/window.py in validate(self)
   1104             # we don't allow center
   1105             if self.center:
-> 1106                 raise NotImplementedError("center is not implemented "
   1107                                           "for datetimelike and offset "
   1108                                           "based windows")

NotImplementedError: center is not implemented for datetimelike and offset based windows

Problem description

The rolling window is not able to be centered with a datetimelike index, which is a problem when dealing with time series.

I think that this is a WIP feature. I don't see anything blocking about it.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.0.6
pip: 9.0.1
setuptools: 38.2.3
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.2.2
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Enhancement Resample

Most helpful comment

@jreback I can also state that having a centered representation is really useful and important for time series analysis. I can only say that for the field of engineering it is frequently used (and you usually think in terms of a centered rolling window when using rolling windows...), resulting in two cases:

  • Case 1: most people don't really know what they are really doing with their data, thus they just use the rolling window and expect that it is centered.
  • Case 2: those who do know what they are doing, will have shift the data which is quite cumbersome, especially with unevenly spaced time series.

Just one example: If you want to calculate the average power output as a rolling statistic over window='1d', it is not straight-forward when the value for the day (from 00:00:00 to 23:59:59) is set to 23:59:59

Currently shifting (and resampling for unevenly spaced time series) is the only possibility.

Edit: Another important point: When resampling with min_periods < window, the last values of the rolling data will be lost when shifting, since non-centered rolling will set the values of the last window to the last index, even though there is still more data than min_periods. This can't be an intentional behavious, since it is a loss of information.
The only workaround for this is using a rolling window by replacing the datetime index with an integer index. But this makes it impossible to use rolling with a window-length which is not a multiple of the frequency.

All 12 comments

what would a centered value actually do here? is this actually useful?

My usecase:

I work with time series acquired by a field sampler. The sampling rate is approximately evenly spaced, with some variation (instrument failure, etc). The signal may be noisy so I want to apply a rolling mean in order to see the general tendency.

Doing a average based on the number of point is not a good idea as they can represent a variable time interval. Then, I want the rolling mean account for, let's say, the past 2.5 days to the next 2.5 days.

One solution may be to do a rolling mean without centering and then shift it, but it only work with regularly space data. See for instance the following DataFrame with/without the "2010-01-07" day:

In[89]: idx = pd.date_range(start="2010-01-01", end="2010-01-10", fred="1D")
In[90]: df = pd.DataFrame(index=idx, data=[1,1,1,2,2,2,3,3,3,4])
In[91]: df
Out[91]:
            0
2010-01-01  1
2010-01-02  1
2010-01-03  1
2010-01-04  2
2010-01-05  2
2010-01-06  2
2010-01-07  3
2010-01-08  3
2010-01-09  3
2010-01-10  4
In [92]: df.rolling(window=5, center=True).mean()
Out[92]: 
              0
2010-01-01  NaN
2010-01-02  NaN
2010-01-03  1.4
2010-01-04  1.6
2010-01-05  2.0
2010-01-06  2.4
2010-01-07  2.6
2010-01-08  3.0
2010-01-09  NaN
2010-01-10  NaN

In [93]: df.rolling(window="5D", center=False).mean()
Out[93]: 
               0
2010-01-01  1.00
2010-01-02  1.00
2010-01-03  1.00
2010-01-04  1.25
2010-01-05  1.40
2010-01-06  1.60
2010-01-07  2.00
2010-01-08  2.40
2010-01-09  2.60
2010-01-10  3.00
In [94]: # The rolling mean centered with number of sample works great
In [95]: # The rolling mean with date window does something ok. I just don't want the first 5 values
In [96]: # but I just need a shift, I can handle that
In [97]: # Now, let's remove the 2010-01-07, which is something I have in my dataset
In [98]: dfdrop = df.loc[idx.delete(6)]
In [99]: dfdrop
Out[99]: 
            0
2010-01-01  1
2010-01-02  1
2010-01-03  1
2010-01-04  2
2010-01-05  2
2010-01-06  2
2010-01-08  3
2010-01-09  3
2010-01-10  4
In [100]: dfdrop.rolling(window=5, center=True).mean()
Out[100]: 
              0
2010-01-01  NaN
2010-01-02  NaN
2010-01-03  1.4
2010-01-04  1.6
2010-01-05  2.0
2010-01-06  2.4
2010-01-08  2.8
2010-01-09  NaN
2010-01-10  NaN
In [101]: dfdrop.rolling(window="5D", center=False).mean()
Out[101]: 
               0
2010-01-01  1.00
2010-01-02  1.00
2010-01-03  1.00
2010-01-04  1.25
2010-01-05  1.40
2010-01-06  1.60
2010-01-08  2.25
2010-01-09  2.50
2010-01-10  3.00
In [102]: # And the shift trick is no longer an option...

To my point of view, I would expect for each date of my dataset a mean that takes into account the value of the 2.5 days before and after the current date.
I would like to have something like:

In [101]: dfdrop.rolling(window="5D", center=True).mean()
Out[101]: 
               0
2010-01-01  NaN
2010-01-02  NaN
2010-01-03  1.40 #mean(1,1,1,2,2)
2010-01-04  1.60 #mean(1,1,2,2,2)
2010-01-05  1.75 #mean(1,2,2,2) 
2010-01-06  2.25 #mean(2,2,2,3)
2010-01-08  3.00 #mean(2,3,3,4)
2010-01-09  NaN 
2010-01-10  NaN

with time series data, being quite explicit about shifting in time is quite important. The centering is pretty ad-hoc in its implementation, so this would require quite a bit of effort for little gain. If you want to try a PR would be ok.

I have also run across a use case for this. Essentially, it would be used for computing a rolling statistic for "n periods before and after" each observation on a rolling basis.

Note that this trick will not work if the data is of staggered frequency, i.e. if the data is minute-frequency but there are missing minutes:

data.rolling(window='15D', closed='left').statistic().shift(-1296000/2, freq='min')

A barebones implementation might look something like this. (Please excuse my Cython.)

from cython cimport Py_ssize_t

import numpy as np
cimport numpy as cnp
from numpy cimport ndarray, double_t, int64_t
cnp.import_array()


cpdef ndarray rolling_centered_offset(ndarray[double_t, ndim=1] data,
                                      ndarray[int64_t, ndim=1] index,
                                      int64_t td,
                                      f):
    """Rolling function with a time-based offset & centered window.

    Parameters
    ----------
    data: 1D ndarray of floats
        Array for which function will be computed.
    index: 1D ndarray of date-ints
        Dates that align with the data where windows are computed.
        If you have a Pandas DatetimeIndex, use `index.asi8`
    td: numpy.timedelta64
        The offset in either direction, in the same precision as `index`.
        If you have a datetime64[ns] Pandas Index as i8 and a timedelta of
        pd.Timedelta('15D'), you would need to pass
        `np.timedelta64(15, 'D') / np.timedelta64(1, 'ns')`
    f:
        The function/callable statistic, i.e. np.nanmean

    Returns
    -------
    out: ndarray
    """

    cdef:
        Py_ssize_t i, j, k, n
        ndarray[double_t] out
        ndarray[int64_t] starts, ends

    n = data.shape[0]
    out = np.empty(n, dtype=float)

    starts = index.searchsorted(index - td)
    ends = index.searchsorted(index + td)

    for i, (j, k) in enumerate(zip(starts, ends)):
        out[i] = f(data[j:k])

    return out

Example:

import pandas as pd
import numpy as np

np.random.seed(444)
size = 1938

def random_dates(start, end, n):
    # Thanks https://stackoverflow.com/a/50668285/7954504
    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n)).view('M8[ns]'))

start = pd.Timestamp('2014-05-16 13:53:50+0000', tz='UTC')
end = pd.Timestamp('2018-07-31 10:40:47+0000', tz='UTC')

mock = pd.Series(np.random.randint(0, 10, size=size),
                 index=random_dates(start, end, n=size)).sort_index()

from rolling import rolling_centered_offset

td = 1296000000000000

rolling_centered_offset(data=mock.values.astype(float), index=mock.index.asi8, td=td, f=np.nanmean)
array([4.75  , 4.75  , 4.7143, 4.4348, 4.2917, 4.2143, 4.2143, 4.069 ,
       4.069 , 4.2812, ..., 4.6667, 4.6897, 4.875 , 5.0435, 5.0435,
       4.8636, 5.05  , 5.    , 5.    , 4.8889])

One other note about actually reindexing: that is probably a dangerous route as it blows up RAM pretty easily.

@jreback I can also state that having a centered representation is really useful and important for time series analysis. I can only say that for the field of engineering it is frequently used (and you usually think in terms of a centered rolling window when using rolling windows...), resulting in two cases:

  • Case 1: most people don't really know what they are really doing with their data, thus they just use the rolling window and expect that it is centered.
  • Case 2: those who do know what they are doing, will have shift the data which is quite cumbersome, especially with unevenly spaced time series.

Just one example: If you want to calculate the average power output as a rolling statistic over window='1d', it is not straight-forward when the value for the day (from 00:00:00 to 23:59:59) is set to 23:59:59

Currently shifting (and resampling for unevenly spaced time series) is the only possibility.

Edit: Another important point: When resampling with min_periods < window, the last values of the rolling data will be lost when shifting, since non-centered rolling will set the values of the last window to the last index, even though there is still more data than min_periods. This can't be an intentional behavious, since it is a loss of information.
The only workaround for this is using a rolling window by replacing the datetime index with an integer index. But this makes it impossible to use rolling with a window-length which is not a multiple of the frequency.

I'm taking a look into this right now. My thought was that in the cython libwindow, the roll_<func> functions should be modified to take left_win and right_win arguments (instead of the current win argument). These numbers would be computed by the _offset function. As a bonus, this would allow most of the if center statements to be removed in window.py.

In the cython code, each of the roll functions would pass left_win and right_win on to get_window_indexer which would them pass them on to each of the WindowIndexer subclasses. The implementation for the static window indexers would be trivial. The implementation for the variable window indexer wouldn't be too much harder (and would basically just mirror the code that's already there for the left endpoint).

Does anyone have any thoughts on this strategy? On one hand it seems inefficient to have to change the function signatures of each of the cython roll functions individually, but it's not clear to me if there's a slicker way to do that.

Needing this feature for an engineering use case.

I went looking to ensure I have a centered window and got the error message that led me here.
I can see a centered value is not straight forward for unevenly spaced data.

it is also not clear from the documentation if the labels are by default added at the start or at the end of the window; I can see an argument that since end of window labeling is the only one that makes sense for causal systems/filters that's the only one to implement, but plenty of real life system and filters are in fact non-causal...

Any updates on this?
I'm also having an engineering use case: Timeseries analysis of sensordata. Sensors yield a datapoint per second but a few datapoints will be missing (due to various reasons like bad values or transmission errors).

@RadMagnus pandas is all volunteer
if you want something then likely a PR is the best bet

note this is pretty straightforward nowadays by defining a custom indexer

@jreback that is very flattering, however i am by no means qualified (yet?). I'll come back here, when i have the necessary skillset.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericdf picture ericdf  路  3Comments

hiiwave picture hiiwave  路  3Comments

ebran picture ebran  路  3Comments

matthiasroder picture matthiasroder  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments