Pymc3: DOCS: Model selection and Model averaging

Created on 18 Jan 2016  路  35Comments  路  Source: pymc-devs/pymc3

Someone (it'll probably be me) should read this and write up some notes http://arxiv.org/abs/1503.08650

This will give a good explanation of the strengths and weaknesses of the various model-selection metrics we have, and what we also need to add to PyMC3.

Adding here as a placeholder.

This is for you @jonsedar and @twiecki

beginner friendly docs

Most helpful comment

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544), but if you insist to have one information criterion then leave WAIC.

For model averaging see https://arxiv.org/abs/1704.02030
Bayesian stacking introduced in that paper doesn't require much more computation compared to information criterion based weighting, but is much better.

Aki

All 35 comments

Good find, that looks really useful. If you have time to update the model selection example, go for it - otherwise I'll find time over the next couple of weeks.

You think it should all be included in the same documentation/ notebook? Just a few notes summarising the math or do we have other things to add in?

I'd link to the paper, perhaps add in some math where appropriate.

There's also a TODO on crossval - which I'm planning to add. It's made fairly straightforward by using sklearn.cross_validation.ShuffleSplit to select and theano.shared variables to easily substitute train / test sets, then use pm.sample_ppc to make predictions on the test set. I've used it on my own work and can add soon, but am a little pressed for time this week. If you're looking for something to do please feel free!

For cross-validation see also https://github.com/avehtari/PSIS. There's Python reference code for Pareto smoothed importance sampling cross-validation, too.

Guys, I'm a bit pressed this week. If I have time I'll have a stab at one of those contributions. :)

I've not actually gotten around to doing this but @fonnesbeck do we have cross-validation now?

We have leave-one-out in loo

We have some of this cross-validation stuff now, and I think it's a lot easier to add examples/ documentation.

I've not really got the time or the capacity so I'll leave this as an exercise for the reader :)

I just want to mention that we already have 3 notebooks-examples related to model selection.

Certainly there is room for improvement (_unification?_) of these examples.

Thanks @aloctavodia What I am doing now is
1, extending the GLM-model-selection (add bpic and loo),
2, maybe combine model_comparison and model_averaging together
or
extend model_comparison with math and explanation from @avehtari 's survey paper.
or both above (input welcome).

I think adding some theory to model_comparison is something we should do. Not sure about merging it with mode_averaging, but my only concern is to get a too long notebook.

Should we remove bpic? given that we have DIC, WAIC and LOO

I am not too familiar with Bpic, but BPIC and DIC is more or less the same with different weight:

DIC: 2 * mean_deviance - deviance_at_mean
BPIC: 3 * mean_deviance - 2 * deviance_at_mean

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544), but if you insist to have one information criterion then leave WAIC.

For model averaging see https://arxiv.org/abs/1704.02030
Bayesian stacking introduced in that paper doesn't require much more computation compared to information criterion based weighting, but is much better.

Aki

@avehtari thanks for chiming in. Those suggestions make sense to me.

@avehtari thanks for the great suggestions, I will take that into account.

@aloctavodia Is the weight output from pm.compare support for Bayesian Stacking as suggested by @avehtari in the comments above?

@junpenglao The paper suggested by @avehtari is in my To-Read list, so ideas in that paper are not incorporated in pm.compare yet. At this point pm.compare performs a very simple AIC-type weighting (but uses WAIC or LOO instead of AIC). From a quick look at the paper I think we can improve what pm.compare does by doing a Bayesian Bootstrap (but I still have to read the details). Additionally we can have the Bayesian Stacking as another option, apparently is a more reliable method but also a more expensive one.

Sounds good to me. But maybe we should have a separate function for model averaging?

@aloctavodia Something that came up in the Bound doc: Currently Bound doesn't fix the density so that it integrates to 1, wouldn't this break the model comparison code? It does require the actual logp of the observations, not just something proportional to it, right? If so, that should be state clearly in the docs.

hmm good point @aseyboldt , both WAIC and LOO computes the logp of each element in model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/stats.py#L127), it might be problematic with Bound

@junpenglao at this point we can use compare to compute weights and then sample_ppc_w, that accepts a set of weights, to compute weighted posterior predictive samples. We can also have a compareplot that can be used together with compare, to analyze the result of the comparison.

@aseyboldt I think you are right and that can be problematic. I rememberer that I was able to reproduce the values of WAIC in the book Statistical Rethinking, I should probably check this again...

@aloctavodia cool, in that case we can just add a few more columns to the output of pm.compare function to have Bayesian Stacking weight etc.

@junpenglao yeah that right, or alternatively we can ask compare to compute the weights in one way or the other. I guess the decision will depends on the computational cost of getting those weights

FYI @aloctavodia @aseyboldt

x = np.random.randn(100,2)+1
x[x<0]=0

with pm.Model() as model1:
    mu= pm.Normal('mu', 0., 100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace1=pm.sample()

with pm.Model() as model2:
    mu= pm.Bound(pm.Normal, lower=0.)('mu', mu=0., sd=100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace2=pm.sample()
pm.compare([trace1,trace2],[model1,model2])
      WAIC    pWAIC     dWAIC    weight       SE        dSE warning
1  515.267  2.38154         0  0.617567  16.4203          0       0
0  516.225  2.90537  0.958468  0.382433  16.5625  0.0030158       0

pm.compare([trace1,trace2],[model1,model2], ic='LOO')
       LOO     pLOO     dLOO    weight       SE         dSE warning
1  515.451  2.47385        0  0.637353  16.4317           0       1
0  516.579  3.08234  1.12778  0.362647  16.5911  0.00699399       1

Great! @junpenglao. It seems that differences are due to sample size, increasing sample size should reduce them.

@aloctavodia Why sample size? The only difference between the models is how the HalfNormals prior on mu is defined. They should be identical. But the logp for mu in the second version is always half that of the the logp in the fist model.
I'm not sure how to deal with that other than just putting a couple of fat warnings in the doc and docstring.
I guess proper support for truncated distributions (using the cdf of the distributions) would solve it.

@junpenglao I mean the samples from the posterior. I am not able to check this at the moment, but I think both traces are not exactly equal so the differences are explained just by sampling. Could you compare the traces and also increase the number of steps?

I tried it with 1e5 samples and the second weight is still smaller than the first. Also, the standard error of the difference if LOO and WAIC is much smaller than the difference. I reran it a couple of times and got this result consistently.

Sorry, I did it again and can't reproduce it now. Maybe I messed up the sorting order earlier one or two times. However, the standard error seems to be off. If I use the same model twice, the dWAIC values are consistently much larger than dSE:

      WAIC    pWAIC      dWAIC    weight       SE          dSE warning
1  46.5785  2.26211          0  0.502283  3.90641            0       1
0  46.5967  2.27682  0.0182618  0.497717  3.90201  0.000133433       1

Or am I interpreting the values incorrectly?

Is it possible there is a missing square root in the SE and dSE somewhere?

@aseyboldt you are right! There is a mistake. The square root is misplaced the line d_se = len(diff) * 0.5 * np.var(diff) should be d_se = (len(diff) * np.var(diff))* 0.5

@aloctavodia are you sending a PR?

Also isnt Standard error usually compute as SD/sqrt(n) -> np.std(diff)/np.sqrt(len(diff))

@junpenglao I will not be able to send a PR anytime soon. Today I am begining my vacations and I did not bring my notebook with me. If you or someone else wants to work on this stuff there are a couple of references you can check for related functions. https://github.com/stan-dev/loo and https://github.com/rmcelreath/rethinking

Also there are some examples of WAIC computations that can be compared to the ones in the book Statistical rethinking book here https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3

After vacations I have a couple of things to do and then come back to Argentina, so I will not be able to work on this until the last days of July. Sorry for that.

@aloctavodia no problem! enjoy your holiday ;-)

I think all this is basically covered now

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fonnesbeck picture fonnesbeck  路  49Comments

aloctavodia picture aloctavodia  路  19Comments

HH-1 picture HH-1  路  20Comments

fonnesbeck picture fonnesbeck  路  88Comments

MisterRedactus picture MisterRedactus  路  22Comments