Pandas: Bug in df.pct_change()

Created on 17 Feb 2014  路  7Comments  路  Source: pandas-dev/pandas

df.pct_change incorrectly calculates a value when it shouldn't.

#!/usr/bin/env python

import numpy as np
import pandas as pd
from datetime import datetime


"""
Notice that the two nonmissing data points are 8 days apart, not 7.
Therefore, any attempt to calculate a 7-day (week-over-week) percent change
should result in NaN, not an actual numeric value.
The bug is that pct_change does calculate an (erroneous) actual numeric value.
(The value it calculates is 414.53/430.30 - 1 = -0.036649.)
"""
idx = pd.DatetimeIndex(start=datetime(2013, 9, 12),
                       end=datetime(2013, 9, 21),
                       freq='D')
df = pd.DataFrame(data={'Series': [430.30,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   414.53,
                                   np.NaN]}, index=idx)
pd.show_versions()
print('\n\nInput data:')
print(df)
print("\n\nThe bug is the pct_change for 09-20 should be NaN, but it isn't:\n")
print(df.pct_change(periods=7))

Here is the output:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 12.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.1.0
sphinx: 1.1.3
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.8.3
lxml: 3.2.3
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None


Input data:
            Series
2013-09-12  430.30
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20  414.53
2013-09-21     NaN

[10 rows x 1 columns]


The bug is the pct_change for 09-20 should be NaN, but it isn't:

              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]

All 7 comments

Not a bug, it is pad filling (as the default), which is fill-forward, you can disable by passing fill_method=None)

In [14]: df.pct_change(7,fill_method=None)
Out[14]: 
            Series
2013-09-12     NaN
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20     NaN
2013-09-21     NaN

[10 rows x 1 columns]

In [15]: df.pct_change(7,fill_method='pad')
Out[15]: 
              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]

@jreback While pad filling is convenient, I think it's quite dangerous as the default. "Errors should never pass silently" and all that Zen. Pandas often drops NaNs for convenience, but dropping NaNs when plotting (for example) isn't nearly as easy to overlook. In fact, I just told someone a moment ago that pct_change did not fill, which was completely inaccurate. Luckily I tested it and discovered otherwise. Reading the documentation or even checking parameters didn't cross my mind until I was surprised by the result.

Backwards compatibility is important, too, but perhaps there's a chance this method could default to fill_method=None? The existence of this "bug" report suggests no-fill is more intuitive.

For comparison,

lag = series.shift(1)
change = (series - lag) / lag

seems like it would create the same results as series.pct_change(1) but the filling is a subtle difference.

I haven't done a systematic survey of use cases, but in my experience, most time-series analysis would prefer no-fill. When fill-forward is desired, I'd expect that the fill would be done before the pct_change as a more explicit step.

Hi guys, I think this issue should be re-opened. @selik is absolutely right that "Errors should never pass silently" and unless explicitly done so, this _is_ an error. Not technically, as the default is currently fill_method='pad', but that is not an answer. That is just stating the problem. Upgrading this to fill_method=None is in now way different to deprecating inplace=True, or any other relatively major change that pandas has gone or will go through.

I agree that this is definitely dangerous as a default value and that this issue should be reopened, if not solved elsewhere.

Using pandas 0.24.1 I realised this only at the end of a complicated ML pipeline.
The default behaviour is still 'pad', which is at least deceiving in my opinion.

@jreback Can we reopen this issue with the idea of making a deprecation cycle to switch the default behavior?

pls make a new issue

New issue: #28461 .

Was this page helpful?
0 / 5 - 0 ratings