I had some code running with v0.8.0, but it won't work anymore with version 0.9.0. The issue can be reproduced simply with the following data and code:
from statsmodels.stats.diagnostic import lilliefors
lilliefors([18.50117949, 18.24272814, 18.50674601])
With 0.8.0 I just got a pandas warning when importing
/usr/lib64/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
But now, I get an error when actually running lilliefors on those data:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/testone/lib64/python3.6/site-packages/statsmodels/stats/_lilliefors.py", line 344, in kstest_fit
pval = lilliefors_table.prob(d_ks, nobs)
File "/home/testone/lib64/python3.6/site-packages/statsmodels/stats/tabledist.py", line 120, in prob
critv = self._critvals(n)
File "/home/testone/lib64/python3.6/site-packages/statsmodels/stats/tabledist.py", line 99, in _critvals
return np.array([p(n) for p in self.polyn])
File "/home/testone/lib64/python3.6/site-packages/statsmodels/stats/tabledist.py", line 99, in <listcomp>
return np.array([p(n) for p in self.polyn])
File "/usr/lib64/python3.6/site-packages/scipy/interpolate/polyint.py", line 79, in __call__
y = self._evaluate(x)
File "/usr/lib64/python3.6/site-packages/scipy/interpolate/interpolate.py", line 634, in _evaluate
below_bounds, above_bounds = self._check_bounds(x_new)
File "/usr/lib64/python3.6/site-packages/scipy/interpolate/interpolate.py", line 663, in _check_bounds
raise ValueError("A value in x_new is below the interpolation "
ValueError: A value in x_new is below the interpolation range.
I checked the doc, but I did not see any specific instruction on how data should be: is this a regression or am I actually providing wrong data? In the latter case, can you refer me to the relative doc?
Thanks
My guess is that this is a bug/regression. AFAIR the pvalue handling has been refactored in 0.9.
(It could be a floating point issue at the boundary.)
I will try to investigate later today.
(sorry for the delay, slow in switching computers again)
It is a regression if it worked before 0.9.
AFAICS, using the debugger, the minimum sample size for the p-value computation with lilliefors_table.prob(d_ks, nobs) is 4.
The relevant scipy.interpolate.interpolate.interp1d has x values from 4 to 900 (inclusive)
There is also a problem with nobs > 900 but not in all cases
>>> lilliefors(np.random.rand(1500))
(0.064788061921975615, 3.5366480034031793e-16)
>>> lilliefors(np.random.randn(1500))
...
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
>>> np.random.seed(1)
>>> xx = np.random.randn(1500)
>>> lilliefors(xx[:900])
(0.014717319572921217, 0.20000000000000001)
>>> lilliefors(xx[:901])
...
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
also lilliefors(xx[:901], pvalmethod='approx') and lilliefors(xx[:901], pvalmethod='table') produce the same exception
definitely a bug introduced during refactoring in 0.9 (assuming it worked before which I haven't checked)
PR that made the change is #3936
I guess the relevant lines are https://github.com/statsmodels/statsmodels/pull/3936/files#diff-6ad3bd17ebe1ff645184564be437e2e1R341
i.e. we switch to table even if pvalmethod='approx' if approx pval > 0.1
another issue is that tabledist should handle "out of support" points
maybe relevant #4355
general:
maybe switch default to pvalmethod=None which is essentially an 'auto' option to choose the best that we have for a given case.
then 'table' and 'approx' don't have to switch methods.
Note: table and approx are just different ways of computing the same distribution values. It is not like a substantive method like choosing between asymptotic and bootstrap pvalues.
the tables (norm and expon) go only up to pvalue=0.2,
approx formula says intended for p-value<=0.1 but bounds are not checked in pval_lf
get_lilliefors_table has an extension to larger samples, but it is not used in norm. I guess the reason was that the larger sample extension is missing one value, i.e. shape mismatch.
I think we could take just the last value of the smaller sample table, i.e for nobs=900
from f(n)
>>> np.array([0.736, 0.768, 0.805, 0.886, 1.031]) / np.sqrt(900) * 1000
array([ 24.53333333, 25.6 , 26.83333333, 29.53333333, 34.36666667])
which is close to the smaller nobs table, for nobs=900
25, 26, 28, 30, 35, 42
Instead of extending the table, we could also just use 1-D interpolation on f(n) for n is sample nobs.
This would avoid the need for any upper limit.
I'm not sure what to do for nobs=3.
any p-values are likely large, and the only information we might have is pvalue > 0.1.
I don't see any checking for nobs=2.
Based on recent scipy discussion, I guess the pvalue should be 1. (we estimate mean and var and then there is no degrees of freedom left in normal case.)
https://github.com/scipy/scipy/issues/7730#issuecomment-433062911