pandas.Dataframe.interpolate() does not extrapolate even if it is asked to, depending on interpolation method

Created on 13 Feb 2020 · 3Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

a = pd.Series([0, 1, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan])
a_int=a.interpolate(method='cubic', limit_area=None)

Problem description

Some of the offered methods (it seems all of them that are provided by interp1d) are unable to extrapolate over np.nan. However, the limit_area switch for df.interpolate() indicates you can force extrapolation. A combination of limit_area=None and an incompatible method should raise a warning.

There used to be a similar issue where extrapolation over trailing NaN was done unintentionally, so maybe the fix for that overdid it. https://github.com/pandas-dev/pandas/issues/8000

Expected Output

Extrapolation over the NaNs in the array is expected. Using a different method, such as pchip achieves this.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 0.25.3 (also tested with 1.0.0)
numpy : 1.15.4
pytz : 2018.9
dateutil : 2.7.5
pip : 20.0.2
setuptools : 41.0.1
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : 1.8.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.0.3
numexpr : None
odfpy : None
openpyxl : 2.5.12
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

API - Consistency Docs Missing-data

Source

typorian

👍4

Most helpful comment

I second this.

Also, even when it works, it doesn't. The implied meaning of "extrapolate" is that it will continue on the last available trend. However, the observed result is that the last value is repeated.

In:

a = pd.Series([0, 1, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan])
a.interpolate(method='linear', limit_area=None)

Out:

fercook on 19 Apr 2020

👍5

All 3 comments

I second this.

Also, even when it works, it doesn't. The implied meaning of "extrapolate" is that it will continue on the last available trend. However, the observed result is that the last value is repeated.

In:

a = pd.Series([0, 1, np.nan, 3, 4, np.nan, np.nan, np.nan, np.nan])
a.interpolate(method='linear', limit_area=None)

Out:

fercook on 19 Apr 2020

👍5

I also stumbled on this bug.

Also examples in current documentation are confusing - extrapolation mentioned there "fill NaNs outside valid values (extrapolate)" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html:

df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))
df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0

df.interpolate(method='linear', limit_direction='forward', axis=0)
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

... but this is not a linear extrapolation

RealJTG on 27 Apr 2020

👍2

Based on discussions in #8000 it seems we need an argument to specify that extrapolation at the beginning and end of the series can be specified. Alternatively, the docs could reflect that such extrapolation is not provided by interpolate