Pymc3: Implementation of path derivative gradient estimator (NIPS 2016)

Created on 20 Dec 2016 · 9Comments · Source: pymc-devs/pymc3

In NIPS 2016 the article "Sticking the Landing: A Simple Reduced-Variance Gradient for ADVI" was presented. The authors propose an ELBO gradient estimator with less variance for some cases. They call it "path derivative gradient estimator". This may be a good fit for PyMC3.

This is the abstract of the article:

Compared to the REINFORCE gradient estimator, the reparameterization trick
usually gives lower-variance estimators. We propose a simple variant of the
standard reparameterized gradient estimator for the evidence lower bound that
has even lower variance under certain circumstances. Specifically, we decompose
the derivative with respect to the variational parameters into two parts: a path
derivative and the score function. Removing the second term produces an unbiased
gradient estimator whose variance approaches zero as the approximate posterior
approaches the exact posterior. We propose that the removed term has arbitrarily
high variance when the variational posterior has a complex form, as when using
adaptive posteriors such as given by normalizing flows or stochastic Hamiltonian
inference.

enhancements

Source

noe

Most helpful comment

that's done

ferrine on 21 Dec 2016

👍6

All 9 comments

@noe Thanks for sharing!

CC @taku-y @ferrine

twiecki on 20 Dec 2016

Sounds really interesting, thank you!
Now it's time to figure out how to implement it in theano out of the box, I have no ideas yet, lets brainstorm:)

ferrine on 20 Dec 2016

useful link https://gist.github.com/benanne/9212037

ferrine on 20 Dec 2016

that's done

ferrine on 21 Dec 2016

👍6

Wow @ferrine just wow :)

springcoil on 21 Dec 2016

That's the whole implementation?

fonnesbeck on 21 Dec 2016

Does anyone know how to derive the last term of eq. (11) in the paper? Without knowing that, I'm not sure the implementation is correct.

taku-y on 21 Dec 2016

@taku-y when we take gradients of log_q_W(posterior(\phi, \epsilon)) we apply chain rule as log_q_W is parametrized by \phi too. The first term is backpropagated through posterior, that's the path part of it. And the second comes directly from parameters of log_q_W function. What I do is simply setting to zero the second part with theano.gradient.zero_grad.

ferrine on 21 Dec 2016

@ferrine Thanks for your quick response. I will check it.

taku-y on 21 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings