AIC and BIC do not appear to be calculated the same way in AutoReg and ar_select_order. Here's an example.
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
np.random.seed(99999)
coefs = np.array([0.5, -0.25])
y = ArmaProcess(np.r_[1, -coefs]).generate_sample(250)
Then we do model selection.
modsel = ar_select_order(y, maxlag=5, old_names=False)
model.aic has this:
{(1, 2): 0.038243172406288724,
(1, 2, 3): 0.04552812846926287,
(1, 2, 3, 4): 0.0536414426749059,
(1, 2, 3, 4, 5): 0.061296136921122346,
(1,): 0.0735756692249579,
0: 0.26784552200885503}
I believe this means that the AIC for the model with 2 lags is 0.03824...
Estimating the model with 2 lags, I get:
res = AutoReg(y, lags=2, old_names=False).fit()
print(res.summary())
AutoReg Model Results
==============================================================================
Dep. Variable: y No. Observations: 250
Model: AutoReg(2) Log Likelihood -353.692
Method: Conditional MLE S.D. of innovations 1.007
Date: Tue, 03 Nov 2020 AIC 0.047
Time: 16:56:07 BIC 0.103
Sample: 2 HQIC 0.070
250
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0269 0.064 0.420 0.674 -0.099 0.152
y.L1 0.5051 0.062 8.149 0.000 0.384 0.627
y.L2 -0.2020 0.062 -3.259 0.001 -0.323 -0.080
Roots
The AIC for this model doesn't match what we got above. BIC is also different. I tried all the trend options, and seem to get the closest with n but it's still not quite the same.
Shouldn't these be the same? Or is selection model using different options?
I don't know the code here, so a generic answer.
we want to use the same data when doing lag search. So I guess in ar_select_order the data is truncated to 5 lags even with shorter lags in a model.
you could check res = AutoReg(y[3:], lags=2, ...
or something like this
another possible difference in some model is that we want to use a fast approximate method in specification search, and use the best method for the final model.
What @josef-pkt said: Selection requires the LHS variable to be the same, and so the maximum lag length affects the data used to select the model. When you use AutoReg it defaults to the maximum available data. If you want them to be the same you can set hold_back in AutoReg.
Thanks, that makes sense.
It looks like there's an error in the documentation for BIC. It says
ln(饾湈)+饾憫饾憮饾憵饾憸饾憫饾憭饾憴ln(饾憶饾憸饾憦饾憼)/饾憶饾憸饾憦饾憼
but it is currently calculated as
ln(饾湈^2)+(1+饾憫饾憮饾憵饾憸饾憫饾憭饾憴)ln(饾憶饾憸饾憦饾憼)/饾憶饾憸饾憦饾憼
Also, there's a few places where it says "using Lutkephol鈥檚 definition" instead of Lutkepohl.