See also the discussion at StackOverflow.
Linear interpolation on a series with missing data at the end of the array will overwrite trailing missing values with the last non-missing value. In effect, the function extrapolates rather than strictly interpolating.
Example:
import pandas as pd
import numpy as np
a = pd.Series([np.nan, 1, np.nan, 3, np.nan])
a.interpolate()
Yields (note the extrapolated 4):
0 NaN
1 1
2 2
3 3
4 4
5 4
dtype: float64
not
0 NaN
1 1
2 2
3 3
4 4
5 NaN
dtype: float64
I believe the fix is something along the lines of changing lines 1545:1546 in core/common.py from
result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid])
to
result[firstIndex:][invalid] = np.interp(inds[invalid], inds[valid], yvalues[firstIndex:][valid], np.nan, np.nan)
pls show a complete reproducible example (e.g. copy-pastable code).
Then if you would like to do a pull-request would be great. These examples serve as the basis for a test, which should fail w/o a fix and pass after.
Traveling back today. I can take a look this weekend.
I'd like to see what the behavior was before I refactored this stuff.
@TomAugspurger can you circle back on this?
OK, so this is the same behavior as back in 0.11 before I refactored all the interpolate stuff.
>>> pd.__version__
>>> s = pd.Series([np.nan, 1, np.nan, 3, np.nan])
>>> s
0 NaN
1 1
2 NaN
3 3
4 NaN
dtype: float64
>>> s.interpolate()
0 NaN
1 1
2 2
3 3
4 3
dtype: float64
I'll look into adding an argument to handle the NaNs before and after. The default will have to stay the same for now, I think. Possibly switch to the "correct' default of not extrapolating later on.
is there a work around for now?
@Jezzamonn One workaround solution: http://stackoverflow.com/questions/25255496/dataframe-interpolate-extrapolates-over-trailing-missing-data/33390872#33390872
Any updates on this?
@cancan101 there is a closed PR (not merge) #8010 / #8013 which I believe was almost there. If you want to rebase and see where it is would be great.
Given that the filling of the trailing values does not follow the specified method, but just forward fills, I think we could consider this as a bug. However, of course, still a bug that people could rely upon, so not sure whether we should just change the behaviour.
This is definitely a bug. All new panda users will find this behaviour as confusing and error-prone (as I just did). If there is a code that rely on this bug - that's mean there is a bug in that code also. You should fix it.
Interpolate - means interpolate, not extrapolate in any way.
You should fix it.
@relonger welcome to have a PR for this.
this PR actually does provide for this option: https://github.com/pandas-dev/pandas/pull/16513
welcome to have a look at it, seems stalled.
Just curious if there any updates on this issue? 'Cause as in pandas 0.20.3 this is still a puzzling question. See StackOverflow.
see
https://github.com/pandas-dev/pandas/commit/35812eaaecebeeee0ddf07dee4b583c4eea0778
might be able to close this issue
@jreback Thanks for the link. But I just tried one of the test examples in commit 35812ea and I didn't get the expected result as in the test:
>>> pd.__version__
'0.20.3'
>>> s = pd.Series([nan, nan, 3, nan, nan, nan, 7, nan, nan])
>>> s
0 NaN
1 NaN
2 3.0
3 NaN
4 NaN
5 NaN
6 7.0
7 NaN
8 NaN
dtype: float64
>>> s.interpolate(method='linear', limit_area='inside')
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 7.0
8 7.0
dtype: float64
Any ideas? Should I try a newer version of pandas?
EDIT:
Also tried in a newer version of pandas '0.22.0' but still didn't get the expected results. The pandas document says the "limit_area" is new feature in version 0.21.0+. Any ideas?
>>> pd.__version__
'0.22.0'
@jreback UPDATE: limit_area works as expected in pandas 0.23.0+, but not in 0.21.0 or 0.22.0. Maybe the pandas document has a typo as it marks limit_area as "New in version 0.21.0."?
yeah it looks like a typo; this change is in 0.23
would love a PR to update!
yeah it looks like a typo; this change is in 0.23
would love a PR to update!
xref #25418
Most helpful comment
@Jezzamonn One workaround solution: http://stackoverflow.com/questions/25255496/dataframe-interpolate-extrapolates-over-trailing-missing-data/33390872#33390872