Pyro: Mass Matrix Adaptation for HMC and NUTS

Created on 11 May 2018 · 27Comments · Source: pyro-ppl/pyro

I'd like to contribute the mass matrix adaptation mentioned in #1093 . I actually have some preliminary work on it already, but I've run into an issue and I'm not sure how you'd like to proceed:

Currently the MCMC kernels have no access to the number of warmup iterations (since transition between warmup and sampling is handled by the MCMC wrapper), but if we want to adapt the mass matrix during warmup using a block scheme similar to stan (see section 34.2 in the stan reference manual) then the kernel needs access to the number of warmup iterations to divide the warmup into blocks.

I thought about putting this scheduling logic into the MCMC wrapper, but since adapting a mass matrix in this way is specific to certain kernels (HMC and NUTS) it feels like it belongs in the kernels themselves.

However, giving the kernels access to the number of warmup iterations is not quite straightforward: since they are initialized by the user (not the MCMC wrapper) we can't pass it to the constructor, and currently all of the arguments to HMC.setup() are passed directly into get_trace().

Some options that come to mind are:

Add the number of warmup iterations as a parameter in HMC.setup() (Seems reasonably clean to me, but maybe I'm missing something?)
Implementing something like kernel.configure_adaptation(warmup_duration=200) which would be called by the wrapper. (This approach feels less clean because maybe not every kernel will need parameter adaptation, but it won't modify the existing kernel.setup() signature).
In the MCMC wrapper directly set a property on the kernel: self.kernel.warmup_duration = 200. (Simple, but feels more like a hack).

Any thoughts on which of these options (or others?) would be the best approach?

discussion enhancement

Source

LoganWalls

👍2

Most helpful comment

Unrolling feels like it would be more computationally efficient, but...

In case it helps, we're doing similar unrolling in pyro.contrib.autoguide and there are strategies for all of: diagonal normal, multivariate normal, and low-rank multivariate normal (really low-rank + diagonal). You might consider similarly implementing multiple strategies. See AutoGuideContinuous._unpack_latent() for the unrolling implementation.

fritzo on 17 May 2018

👍2

All 27 comments

Thanks for offering to do this! Option 2 seems fine for now; you can just add a base implementation of configure_adaptation to TraceKernel.

@neerajprad we should probably clean up the MCMC code a little bit to make it easier to customize this sort of thing in the future.

eb8680 on 11 May 2018

👍1

I would suggest going with option (1), so that we do not have two places that specify the warmup_steps - the kernel themselves and the MCMC constructor (since burnin or warmup is a more general concept).

So you can just modify the MCMC class as follows and the corresponding setup methods in HMC and NUTS. I think that should cover your use case?

self.kernel.setup(self.warmup_steps, *args, **kwargs)

neerajprad on 11 May 2018

👍1

we should probably clean up the MCMC code a little bit to make it easier to customize this sort of thing in the future.

Once we make the above change, I don't foresee any major changes needed within the MCMC class, at least immediately. Let me know if you have any suggestions on that. I have a couple of things I would like to see addressed in #1093 pertaining to restructuring the code - specially those related to being able to specify / add default warmup steps when the user chooses step size adaptation, and also a way of controlling some parameters like target acceptance probability. Please do add to that if you have any specific suggestions in mind.

neerajprad on 11 May 2018

I would suggest going with option (1), so that we do not have two places that specify the warmup_steps - the kernel themselves and the MCMC constructor (since burnin or warmup is a more general concept).

@neerajprad I agree with the idea that we shouldn't need to specify warmup_steps in multiple places, but my original thinking behind option 2 was actually to have MCMC call something like self.kernel.configure_adaptation(self.warmup_steps) so really it would only need to be specified once (in the MCMC constructor only). (But it does open up the possibility of a user accidentally calling configure_adaptation separately). That said, I have nothing against option 1 either.

Aside: Whichever way we decide, maybe the target acceptance rate (another feature from #1093 ) can be treated in a similar way in the future (either we can pass it to kernel.setup or to kernel.configure_adaptation).

With those comments in mind does either option sound better?

LoganWalls on 12 May 2018

I'm actually leaning toward option 2 now: I'm thinking down the road once other features like automatically setting the number of warmup steps, target acceptance rates, etc. are implemented, then the kernels will probably need some equivalent of a configure_adaptation method anyway (in my mind the question is more whether it should be public or private).

LoganWalls on 12 May 2018

I'm thinking down the road once other features like automatically setting the number of warmup steps, target acceptance rates, etc. are implemented, then the kernels will probably need some equivalent of a configure_adaptation method anyway (in my mind the question is more whether it should be public or private).

Thanks for sharing your thoughts. With the current structure, the way to do that would be to pass the target acceptance probability or other adaptation parameters or dictionary into the kernel constructor, and then the setup() method would just pick them up when called from MCMC. This will be simpler than introducing a separate configure_adaptation method. If the configure_adaptation method is only going to be called from inside the MCMC class, then it will just be doing what setup does now (we can certainly abstract the adaptation specific parts into a private _configure_adaptation method which is called by setup). And if its going to be called from both inside MCMC as well as by the user, that seems like a more complicated API, that we should justify.

neerajprad on 12 May 2018

Option 1 seems good to me. If the kernel knows the number of warmup step, then we don't have to use the end_warmup method in MCMC. I think that option 2 just does what option 1 does, so unnecessary.

fehiepsi on 12 May 2018

Alright, I'm sold on option 1. I'll proceed with that one if no one else has any concerns?

LoganWalls on 12 May 2018

👍1

If the kernel knows the number of warmup step, then we don't have to use the end_warmup method in MCMC.

That seems reasonable, should we make end_warmup a private method then? It doesn't appear to be called anywhere besides MCMC, and if we're going to have the kernels themselves manage the warmup then I don't think it makes as much sense to have an outward-facing method to control it.

LoganWalls on 12 May 2018

👍1

That seems reasonable, should we make end_warmup a private method then?

That sounds good!

neerajprad on 12 May 2018

👍1

Progress is coming along pretty well. I'm starting with a (non-identity) diagonal mass matrix, and I'll try to implement the dense version if I have time later. I have a quick question though (please forgive my ignorance):

When applying the mass matrix during the potential energy calculation, where does it need to be applied (before / after scaling and adjustments)?

def _potential_energy(self, z):
        # Since the model is specified in the constrained space, transform the
        # unconstrained R.V.s `z` to the constrained space.
        z_constrained = z.copy()
        for name, transform in self.transforms.items():
            z_constrained[name] = transform.inv(z_constrained[name])
        trace = self._get_trace(z_constrained)
        potential_energy = -trace.log_prob_sum()
        # adjust by the jacobian for this transformation.
        for name, transform in self.transforms.items():
            potential_energy += transform.log_abs_det_jacobian(z_constrained[name], z[name]).sum()
        return potential_energy

My current understanding is that I should apply it in the final loop (after the jacobian adjustment?), something like:

potential_energy += apply_mass(transform.log_abs_det_jacobian(z_constrained[name],z[name])).sum()

Am I on the right track, or have I misunderstood?

LoganWalls on 14 May 2018

👍1

When applying the mass matrix during the potential energy calculation, where does it need to be applied (before / after scaling and adjustments)?

I probably am misunderstanding something - why do we need to modify the potential energy computation? I would have thought that this will only affect the kinetic energy (and hence also the distribution from which we are sampling the momenta).

neerajprad on 14 May 2018

👍2

I think you're right actually, it should only be applied to the kinetic energy. I was referring to Betancourt's paper but after reading your comment and looking back at Neal's I think I was misreading Betancourt's.

Apologies!

LoganWalls on 14 May 2018

I think that though two methods are equivalent, modifying kinetic energy is easier because we don't have to work on transformed position and don't have to deal with inverses of a large matrix (in case of using full covariance matrix and high dimensional). All we need is to use Welford's online scheme to update the inverse of the mass matrix (which is covariance/variance of positions), then use it to calculate the kinetic energy.

By the way, I am sure about why Stan devs modified variance this way. Maybe the estimated variance of positions in early steps is not stable so they have to scale it a bit to 0.001.

fehiepsi on 15 May 2018

That's great to hear! I've implemented Welford's scheme already actually, but I'm working on regularizing the estimate. I think I found stan's regularization term, but I'm not sure exactly how they've derived those values.

I did some reading on estimating shrinkage for covariance matrices and based on my (cursory) understanding Ledoit-Wolf and Oracle Approximating Shrinkage appear to be the most popular choices (though the latter assumes the data is gaussian).

Any thoughts on the best solution here (using stan's numbers, implementing one of the above, or another solution)?

LoganWalls on 15 May 2018

👍1

To me, go for Stan because its devs are well-known researchers working in this field. ;)

fehiepsi on 15 May 2018

😄1

On Mon, May 14, 2018 at 9:57 PM, Du Phan notifications@github.com wrote:

To me, go for Stan because its devs are well-known researchers working in
this field. ;)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/uber/pyro/issues/1137#issuecomment-389042787, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACWGWGVipe9_NHz12vbT37CssxkkUWSAks5tymAygaJpZM4T71sc
.

hubayirp on 16 May 2018

Almost done! Two final issues:

Dense Mass Multiplication

I'm not sure about the most efficient way to implement the multiplication of the dense mass matrix:
Following the existing kinetic energy implementation I implemented the diagonal version using a comprehension like so:

0.5 * sum(x.pow(2).mul(variance[site_name]).sum() for site_name, x in iteritems(r))

Where variance is a dictionary: {'site_name': site_variance_estimate}.

For the dense version is it still preferable to multiply each site by a covariance matrix in a comprehension, or should the sites be unrolled into one big matrix and multiplied by a single covariance matrix? Unrolling feels like it would be more computationally efficient, but the dimensionality of that matrix will grow pretty fast. (Also I don't know about how much overhead the unrolling will add vs. a comprehension?)

Interaction with Step-Size Adaptation

While testing the diagonal mass matrix adaptation (using the baseball example) I'm not observing very good performance. As soon as the first mass estimate is obtained, step-size adaptation pushes the step-size to very small values (~1e-5), so everything becomes (unreasonably?) slow. I also observe this same behavior if I use the HMC kernel rather than NUTS. Any thoughts on this?

LoganWalls on 17 May 2018

Unrolling feels like it would be more computationally efficient, but...

fritzo on 17 May 2018

👍2

@fehiepsi @neerajprad
Given the discussion in #1203 , should the next PR be the changes to r and z calculations to incorporate the mass matrix? I'm happy to take an initial swing at it myself, but I think at some point I'd appreciate some help from one of you with that part.

LoganWalls on 10 Jul 2018

Given the discussion in #1203 , should the next PR be the changes to r and z calculations to incorporate the mass matrix? I'm happy to take an initial swing at it myself, but I think at some point I'd appreciate some help from one of you with that part.

Thanks, @LoganWalls. That sounds reasonable. I'll be happy to jump in anytime (if you could add me as a contributor to your fork, we can pair code too). You could also start with the other smaller interface changes you proposed in your original PR like changes to MCMC interface with regard to end_warmup, or fixed ordering for the sample sites so that the unrolled tensors have a deterministic ordering. @fehiepsi - what do you think?

neerajprad on 11 Jul 2018

@LoganWalls @neerajprad I will give detailed comments on your PR. To me, it is the remaining missing part (in HMC theory) which we have not implemented. It would be great to have!

fehiepsi on 11 Jul 2018

@LoganWalls @neerajprad Is there any new effort on this? If not, then I will try to implement this feature next week. :)

fehiepsi on 13 Oct 2018

@fehiepsi - That sounds great! Please feel free to start work on this, and let me know if you need help with any kind of refactoring to support this. I am working on parallel chains, but that shouldn't conflict with this.

neerajprad on 13 Oct 2018

👍1

@fehiepsi @neerajprad - Apologies that there hasn't been much movement here. I haven't had much time for personal projects lately. I'm happy to add you both as collaborators on my repository if you'd like to use any of the work that's been done there already.

LoganWalls on 13 Oct 2018

@LoganWalls You have done most of the work! I'll just revise your pull request #1178 a bit to make it better fit with our discussion in your implementation of Welford scheme. ;)

fehiepsi on 13 Oct 2018

Closed via #1442.

fehiepsi on 16 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JIT for differentiable_loss

fehiepsi · 3Comments

multi-chain mcmc fails for PriorKernel

fehiepsi · 4Comments

Test failure of enum-parallel gradients after PyTorch #5776

neerajprad · 5Comments

Cannot run the example code due to unknown parameter

lundlab-kaltinel · 3Comments

Autoguide inference is incorrect for VonMises distribution

fritzo · 4Comments