Pymc3: DOCS: Model selection and Model averaging

Created on 18 Jan 2016 · 35Comments · Source: pymc-devs/pymc3

Someone (it'll probably be me) should read this and write up some notes http://arxiv.org/abs/1503.08650

This will give a good explanation of the strengths and weaknesses of the various model-selection metrics we have, and what we also need to add to PyMC3.

Adding here as a placeholder.

This is for you @jonsedar and @twiecki

beginner friendly docs

Source

springcoil

Most helpful comment

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in http://link.springer.com/article/10.1007/s11222-016-9696-4 (preprint https://arxiv.org/abs/1507.04544), but if you insist to have one information criterion then leave WAIC.

For model averaging see https://arxiv.org/abs/1704.02030
Bayesian stacking introduced in that paper doesn't require much more computation compared to information criterion based weighting, but is much better.

Aki

avehtari on 6 Jul 2017

👍2

All 35 comments

Good find, that looks really useful. If you have time to update the model selection example, go for it - otherwise I'll find time over the next couple of weeks.

jonsedar on 18 Jan 2016

You think it should all be included in the same documentation/ notebook? Just a few notes summarising the math or do we have other things to add in?

springcoil on 18 Jan 2016

I'd link to the paper, perhaps add in some math where appropriate.

There's also a TODO on crossval - which I'm planning to add. It's made fairly straightforward by using sklearn.cross_validation.ShuffleSplit to select and theano.shared variables to easily substitute train / test sets, then use pm.sample_ppc to make predictions on the test set. I've used it on my own work and can add soon, but am a little pressed for time this week. If you're looking for something to do please feel free!

jonsedar on 18 Jan 2016

For cross-validation see also https://github.com/avehtari/PSIS. There's Python reference code for Pareto smoothed importance sampling cross-validation, too.

avehtari on 18 Jan 2016

Guys, I'm a bit pressed this week. If I have time I'll have a stab at one of those contributions. :)

springcoil on 18 Jan 2016

I've not actually gotten around to doing this but @fonnesbeck do we have cross-validation now?

springcoil on 3 Sep 2016

We have leave-one-out in loo

fonnesbeck on 3 Sep 2016

We have some of this cross-validation stuff now, and I think it's a lot easier to add examples/ documentation.

I've not really got the time or the capacity so I'll leave this as an exercise for the reader :)

springcoil on 13 Mar 2017

😄1

I just want to mention that we already have 3 notebooks-examples related to model selection.

model selection
model averaging
model selection in the context of the GLM module.

Certainly there is room for improvement (_unification?_) of these examples.

aloctavodia on 6 Jul 2017

Thanks @aloctavodia What I am doing now is
1, extending the GLM-model-selection (add bpic and loo),
2, maybe combine model_comparison and model_averaging together
or
extend model_comparison with math and explanation from @avehtari 's survey paper.
or both above (input welcome).

junpenglao on 6 Jul 2017

I think adding some theory to model_comparison is something we should do. Not sure about merging it with mode_averaging, but my only concern is to get a too long notebook.

Should we remove bpic? given that we have DIC, WAIC and LOO

aloctavodia on 6 Jul 2017

I am not too familiar with Bpic, but BPIC and DIC is more or less the same with different weight:

DIC: 2 * mean_deviance - deviance_at_mean
BPIC: 3 * mean_deviance - 2 * deviance_at_mean

junpenglao on 6 Jul 2017

First, it's great to have demos!

I recommend dropping DIC and BPIC as you can always recommend to use WAIC instead, and having too many options causes just confusion.

Aki

avehtari on 6 Jul 2017

👍2

@avehtari thanks for chiming in. Those suggestions make sense to me.

twiecki on 6 Jul 2017

@avehtari thanks for the great suggestions, I will take that into account.

@aloctavodia Is the weight output from pm.compare support for Bayesian Stacking as suggested by @avehtari in the comments above?

junpenglao on 7 Jul 2017

@junpenglao The paper suggested by @avehtari is in my To-Read list, so ideas in that paper are not incorporated in pm.compare yet. At this point pm.compare performs a very simple AIC-type weighting (but uses WAIC or LOO instead of AIC). From a quick look at the paper I think we can improve what pm.compare does by doing a Bayesian Bootstrap (but I still have to read the details). Additionally we can have the Bayesian Stacking as another option, apparently is a more reliable method but also a more expensive one.

aloctavodia on 7 Jul 2017

Sounds good to me. But maybe we should have a separate function for model averaging?

junpenglao on 7 Jul 2017

@aloctavodia Something that came up in the Bound doc: Currently Bound doesn't fix the density so that it integrates to 1, wouldn't this break the model comparison code? It does require the actual logp of the observations, not just something proportional to it, right? If so, that should be state clearly in the docs.

aseyboldt on 7 Jul 2017

hmm good point @aseyboldt , both WAIC and LOO computes the logp of each element in model (https://github.com/pymc-devs/pymc3/blob/master/pymc3/stats.py#L127), it might be problematic with Bound

junpenglao on 7 Jul 2017

@junpenglao at this point we can use compare to compute weights and then sample_ppc_w, that accepts a set of weights, to compute weighted posterior predictive samples. We can also have a compareplot that can be used together with compare, to analyze the result of the comparison.

@aseyboldt I think you are right and that can be problematic. I rememberer that I was able to reproduce the values of WAIC in the book Statistical Rethinking, I should probably check this again...

aloctavodia on 7 Jul 2017

@aloctavodia cool, in that case we can just add a few more columns to the output of pm.compare function to have Bayesian Stacking weight etc.

junpenglao on 7 Jul 2017

@junpenglao yeah that right, or alternatively we can ask compare to compute the weights in one way or the other. I guess the decision will depends on the computational cost of getting those weights

aloctavodia on 7 Jul 2017

FYI @aloctavodia @aseyboldt

x = np.random.randn(100,2)+1
x[x<0]=0

with pm.Model() as model1:
    mu= pm.Normal('mu', 0., 100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace1=pm.sample()

with pm.Model() as model2:
    mu= pm.Bound(pm.Normal, lower=0.)('mu', mu=0., sd=100., shape=2)
    sd= pm.Gamma('sd', alpha=1.,beta=1.)
    obs=pm.Normal('obs', mu=mu, sd=sd, observed=x)
    trace2=pm.sample()

pm.compare([trace1,trace2],[model1,model2])
      WAIC    pWAIC     dWAIC    weight       SE        dSE warning
1  515.267  2.38154         0  0.617567  16.4203          0       0
0  516.225  2.90537  0.958468  0.382433  16.5625  0.0030158       0

pm.compare([trace1,trace2],[model1,model2], ic='LOO')
       LOO     pLOO     dLOO    weight       SE         dSE warning
1  515.451  2.47385        0  0.637353  16.4317           0       1
0  516.579  3.08234  1.12778  0.362647  16.5911  0.00699399       1

junpenglao on 7 Jul 2017

Great! @junpenglao. It seems that differences are due to sample size, increasing sample size should reduce them.

aloctavodia on 7 Jul 2017

@aloctavodia Why sample size? The only difference between the models is how the HalfNormals prior on mu is defined. They should be identical. But the logp for mu in the second version is always half that of the the logp in the fist model.
I'm not sure how to deal with that other than just putting a couple of fat warnings in the doc and docstring.
I guess proper support for truncated distributions (using the cdf of the distributions) would solve it.

aseyboldt on 7 Jul 2017

@junpenglao I mean the samples from the posterior. I am not able to check this at the moment, but I think both traces are not exactly equal so the differences are explained just by sampling. Could you compare the traces and also increase the number of steps?

aloctavodia on 7 Jul 2017

I tried it with 1e5 samples and the second weight is still smaller than the first. Also, the standard error of the difference if LOO and WAIC is much smaller than the difference. I reran it a couple of times and got this result consistently.

aseyboldt on 7 Jul 2017

👍1

Sorry, I did it again and can't reproduce it now. Maybe I messed up the sorting order earlier one or two times. However, the standard error seems to be off. If I use the same model twice, the dWAIC values are consistently much larger than dSE:

      WAIC    pWAIC      dWAIC    weight       SE          dSE warning
1  46.5785  2.26211          0  0.502283  3.90641            0       1
0  46.5967  2.27682  0.0182618  0.497717  3.90201  0.000133433       1

Or am I interpreting the values incorrectly?

aseyboldt on 7 Jul 2017

Is it possible there is a missing square root in the SE and dSE somewhere?

aseyboldt on 7 Jul 2017

@aseyboldt you are right! There is a mistake. The square root is misplaced the line d_se = len(diff) * 0.5 * np.var(diff) should be d_se = (len(diff) * np.var(diff))* 0.5

aloctavodia on 8 Jul 2017

@aloctavodia are you sending a PR?

junpenglao on 8 Jul 2017

Also isnt Standard error usually compute as SD/sqrt(n) -> np.std(diff)/np.sqrt(len(diff))

junpenglao on 8 Jul 2017

@junpenglao I will not be able to send a PR anytime soon. Today I am begining my vacations and I did not bring my notebook with me. If you or someone else wants to work on this stuff there are a couple of references you can check for related functions. https://github.com/stan-dev/loo and https://github.com/rmcelreath/rethinking

Also there are some examples of WAIC computations that can be compared to the ones in the book Statistical rethinking book here https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3

After vacations I have a couple of things to do and then come back to Argentina, so I will not be able to work on this until the last days of July. Sorry for that.

aloctavodia on 8 Jul 2017

@aloctavodia no problem! enjoy your holiday ;-)

junpenglao on 8 Jul 2017

I think all this is basically covered now

aloctavodia on 19 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Multiprocessing failure

fonnesbeck · 49Comments

Memory Error: OrderedLogistic

aloctavodia · 19Comments

Sampling in parallel bugs under Windows OS

HH-1 · 20Comments

Vectors of multivariate variables

fonnesbeck · 88Comments

NUTS Sampler Fails with pygpu.gpuarray.GpuArrayException Error

MisterRedactus · 22Comments