df = pd.DataFrame({'a': [None, 2, 3]}, index=pd.to_datetime(['20170403', '20170404', '20170405']))
df.rolling('3d', min_periods=1)['a'].sum()
df.rolling('3d', min_periods=1)['a'].min()
df.rolling('3d', min_periods=1)['a'].max()
Even if we set min_periods=1, the functions min and max give NaN if there is one NaN value inside the time-aware rolling window.
However, there is no bug when the window width is fixed (not a time period):
In [397]: df.rolling(3, min_periods=1)['a'].min()
Out[397]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64
The expected output, analogously to the one given by the function sum, should be a non-NaN value if at least there is a non-NaN value inside the rolling window.
In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64
In [397]: df.rolling('3d', min_periods=1)['a'].min()
Out[397]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 3.0
Name: a, dtype: float64
pd.show_versions()
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8.1
boto: 2.45.0
pandas_datareader: None
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA). For a window that is specified by an offset,
this will default to 1.
you are specifying a window by an offset. So what exactly would min_periods=1 actually mean?
It is essentially not implemented. I guess the docs could be better.
cc @chrisaycock
I think you actually want something like min_count (similar to https://github.com/pandas-dev/pandas/issues/11167).
or min_periods could actually take an offset. (e.g. 1s), but again what would that actually mean?
The meaning of min_periods, independently of the type of window (either of fixed width indicated by an integer, or temporal width indicated by an offset), is the minimum number of non-NaN values that must exist inside the window in order to perform the function evaluation ignoring the other NaNs inside the window; otherwise, return NaN.
Note that min_periods works fine with an offset for the other functions, like sum:
In [403]: df.rolling('3d', min_periods=1)['a'].sum()
Out[403]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 5.0
Name: a, dtype: float64
In [404]: df.rolling('3d', min_periods=2)['a'].sum()
Out[404]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 5.0
Name: a, dtype: float64
In [405]: df.rolling('3d', min_periods=3)['a'].sum()
Out[405]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64
I have some questions of my own. pandas by default excludes NaN and numpy includes it:
In [35]: df.a.min()
Out[35]: 2.0
In [36]: df.a.values.min()
Out[36]: nan
But then for some reason, calling numpy as a stand-alone function excludes the NaN, which seems to contradict their docs:
In [37]: np.min(df.a)
Out[37]: 2.0
And if I try their version that explicitly excludes NaN, I get back a Series instead of a scalar!
In [38]: np.nanmin(df.a)
Out[38]:
2017-04-03 2.0
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64
So it seems there are lots of unexpected results here.
@chrisaycock Concerning your first question,
Forgetting offsets for moment, why does
min_periodcause this to have a different value?```In [23]: df.rolling(3)['a'].min()
Out[23]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64In [24]: df.rolling(3, min_periods=1)['a'].min()
Out[24]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64
for a fixed width rolling window (specified by an integer), the default value for the parameter min_periods is the width of the window.
These are equivalent:
In [406]: df.rolling(3)['a'].min()
Out[406]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64
In [407]: df.rolling(3, min_periods=3)['a'].min()
Out[407]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64
@chrisaycock For the other questions, you are passing a Pandas Series as an argument to Numpy functions, which expect an array or an ndarray.
If you use the Pandas Series attribute .values, you get a Numpy ndarray and Numpy functions give the expected results:
In [23]: np.min(df.a.values)
Out[23]: nan
In [24]: np.nanmin(df.a.values)
Out[24]: 2.0
Nevertheless, I think this is a digression with respect to the original issue: Pandas min and max functions (contrary to sum and others) do not give the expected output when there is a NaN within a time-aware (specified by a time offset) rolling window.
oh so this works for the numeric ones just not min. max with an offset?
if that is the case it is a bug
@jreback This is the output for other functions:
In [30]: df.rolling('3d')['a'].sum()
Out[30]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 5.0
Name: a, dtype: float64
In [31]: df.rolling('3d')['a'].count()
Out[31]:
2017-04-03 NaN
2017-04-04 1.0
2017-04-05 2.0
Name: a, dtype: float64
In [32]: df.rolling('3d')['a'].mean()
Out[32]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.5
Name: a, dtype: float64
In [33]: df.rolling('3d')['a'].median()
Out[33]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.5
Name: a, dtype: float64
whereas this is the output for the functions min and max:
In [34]: df.rolling('3d')['a'].min()
Out[34]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64
In [35]: df.rolling('3d')['a'].max()
Out[35]:
2017-04-03 NaN
2017-04-04 NaN
2017-04-05 NaN
Name: a, dtype: float64
It is maybe not due to the min_periods, but rather the min/max function implementation, as doing it with an apply, you get the expected result:
In [14]: df.rolling('3d', min_periods=1)['a'].apply(lambda x: np.nanmin(x))
Out[14]:
2017-04-03 NaN
2017-04-04 2.0
2017-04-05 2.0
Name: a, dtype: float64
But I agree with @albertvillanova, this is certainly a bug.
@chrisaycock
But then for some reason, calling numpy as a stand-alone function excludes the NaN
That is because if you do np.min(series), under the hood it will check if the series object has a min method, and use that. So that actually uses series.min(), hence the confusing result if you expected numpy nan semantics.
My point was the inconsistent nanmin, which apparently has been reported before as #8383 and numpy/numpy#5114.
Regarding this particular issue, yes, min/max should have the same behavior as sum. The min_periods is a red herring.
ok if someone wants to take a crack at this, have at it.
Was it fixed?
@jreback this is working fine in 0.24.x
@ihsansecer how is this on master?
if working so we have a test for this?
thanks for checking @ihsansecer