Pandas: Methods min and max give NaN in time-aware rolling window even if min_periods=1

Created on 5 Apr 2017 · 16Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'a': [None, 2, 3]}, index=pd.to_datetime(['20170403', '20170404', '20170405']))

df.rolling('3d', min_periods=1)['a'].sum()

df.rolling('3d', min_periods=1)['a'].min()
df.rolling('3d', min_periods=1)['a'].max()

Problem description

Even if we set min_periods=1, the functions min and max give NaN if there is one NaN value inside the time-aware rolling window.

However, there is no bug when the window width is fixed (not a time period):

In [397]: df.rolling(3, min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

Expected Output

The expected output, analogously to the one given by the function sum, should be a non-NaN value if at least there is a non-NaN value inside the rolling window.

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    3.0
Name: a, dtype: float64

Output of `pd.show_versions()`

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None

Bug Reshaping Timeseries Window

Source

albertvillanova

All 16 comments

min_periods : int, default None
    Minimum number of observations in window required to have a value
    (otherwise result is NA). For a window that is specified by an offset,
    this will default to 1.

you are specifying a window by an offset. So what exactly would min_periods=1 actually mean?

It is essentially not implemented. I guess the docs could be better.

cc @chrisaycock

jreback on 5 Apr 2017

I think you actually want something like min_count (similar to https://github.com/pandas-dev/pandas/issues/11167).

or min_periods could actually take an offset. (e.g. 1s), but again what would that actually mean?

jreback on 5 Apr 2017

The meaning of min_periods, independently of the type of window (either of fixed width indicated by an integer, or temporal width indicated by an offset), is the minimum number of non-NaN values that must exist inside the window in order to perform the function evaluation ignoring the other NaNs inside the window; otherwise, return NaN.

Note that min_periods works fine with an offset for the other functions, like sum:

In [403]: df.rolling('3d', min_periods=1)['a'].sum()
Out[403]:
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [404]: df.rolling('3d', min_periods=2)['a'].sum()
Out[404]:
2017-04-03    NaN
2017-04-04    NaN
2017-04-05    5.0
Name: a, dtype: float64

In [405]: df.rolling('3d', min_periods=3)['a'].sum()
Out[405]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

albertvillanova on 5 Apr 2017

I have some questions of my own. pandas by default excludes NaN and numpy includes it:

In [35]: df.a.min()
Out[35]: 2.0

In [36]: df.a.values.min()
Out[36]: nan

But then for some reason, calling numpy as a stand-alone function excludes the NaN, which seems to contradict their docs:

In [37]: np.min(df.a)
Out[37]: 2.0

And if I try their version that explicitly excludes NaN, I get back a Series instead of a scalar!

In [38]: np.nanmin(df.a)
Out[38]:
2017-04-03    2.0
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

So it seems there are lots of unexpected results here.

chrisaycock on 5 Apr 2017

@chrisaycock Concerning your first question,

Forgetting offsets for moment, why does min_period cause this to have a different value?

```In [23]: df.rolling(3)['a'].min()
Out[23]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64

In [24]: df.rolling(3, min_periods=1)['a'].min()
Out[24]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64

for a fixed width rolling window (specified by an integer), the default value for the parameter min_periods is the width of the window.

These are equivalent:

In [406]: df.rolling(3)['a'].min()
Out[406]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [407]: df.rolling(3, min_periods=3)['a'].min()
Out[407]:
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

albertvillanova on 5 Apr 2017

@chrisaycock For the other questions, you are passing a Pandas Series as an argument to Numpy functions, which expect an array or an ndarray.

If you use the Pandas Series attribute .values, you get a Numpy ndarray and Numpy functions give the expected results:

In [23]: np.min(df.a.values)
Out[23]: nan

In [24]: np.nanmin(df.a.values)
Out[24]: 2.0

Nevertheless, I think this is a digression with respect to the original issue: Pandas min and max functions (contrary to sum and others) do not give the expected output when there is a NaN within a time-aware (specified by a time offset) rolling window.

albertvillanova on 6 Apr 2017

oh so this works for the numeric ones just not min. max with an offset?

if that is the case it is a bug

jreback on 6 Apr 2017

👍1

@jreback This is the output for other functions:

In [30]: df.rolling('3d')['a'].sum()
Out[30]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    5.0
Name: a, dtype: float64

In [31]: df.rolling('3d')['a'].count()
Out[31]: 
2017-04-03    NaN
2017-04-04    1.0
2017-04-05    2.0
Name: a, dtype: float64

In [32]: df.rolling('3d')['a'].mean()
Out[32]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

In [33]: df.rolling('3d')['a'].median()
Out[33]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.5
Name: a, dtype: float64

whereas this is the output for the functions min and max:

In [34]: df.rolling('3d')['a'].min()
Out[34]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

In [35]: df.rolling('3d')['a'].max()
Out[35]: 
2017-04-03   NaN
2017-04-04   NaN
2017-04-05   NaN
Name: a, dtype: float64

albertvillanova on 6 Apr 2017

It is maybe not due to the min_periods, but rather the min/max function implementation, as doing it with an apply, you get the expected result:

In [14]: df.rolling('3d', min_periods=1)['a'].apply(lambda x: np.nanmin(x))
Out[14]: 
2017-04-03    NaN
2017-04-04    2.0
2017-04-05    2.0
Name: a, dtype: float64

But I agree with @albertvillanova, this is certainly a bug.

@chrisaycock

But then for some reason, calling numpy as a stand-alone function excludes the NaN

That is because if you do np.min(series), under the hood it will check if the series object has a min method, and use that. So that actually uses series.min(), hence the confusing result if you expected numpy nan semantics.