The results of MultivariateNormal.batch_log_pdf can not be differentiated with respect to the covariance matrix yet if normalized=False. I am opening this issue to keep the discussion regrading this issue alive. See #651 for the details on this.
cc @fritzo
I have started to implement this using a autograd.Function as you suggested. I think I will be able to finish it rather soon. I am posting this code as WIP but it seems to be working already:
class _NormalizationConstant(Function):
@staticmethod
def forward(ctx, sigma, sigma_cholesky, inverse, dimension, return_zero):
ctx.save_for_backward(inverse)
results = torch.Tensor(
[torch.log(sigma_cholesky.diag()).sum() + (dimension / 2) * np.log(2 * np.pi)]).type_as(sigma_cholesky)
return results if not return_zero else torch.zeros(1).type_as(sigma_cholesky)
@staticmethod
def backward(ctx, grad_output):
inverse, = ctx.saved_variables
grad = 2 * inverse - torch.diag(torch.diag(inverse))
return grad_output * grad, None, None, None, None
I figured that passing the inverse would be a good idea since it is always being computed right now anyway and the procedure is rather expensive, so doing it twice would really be a waste of time.
However differentiation wrt. to the Cholesky decomposition does not work yet, but adding that gradient should not really be much of a problem (or so I think at least).
If you agree with this approach I will open a pull request as soon as I have implemented a gradient wrt. to sigma_cholesky.
Your implementation seems reasonable to me. However I would recommend branching on return_zero outside of _NormalizationConstant, since I would guess that torch.log(sigma_cholesky.diag()).sum() ... leads to a more numerically stable gradient than the on you compute in _NormalizationConstant.
I don鈥檛 exactly follow, why should there be any difference in results? If I see it correctly the two expressions are exactly the same.
What I was imagining was
class MultivariateNormal(Distribution):
def batch_log_pdf(self):
...
if self.normalized:
# Gradient is computed automatically by PyTorch.
normalization_factor = (torch.log(self.sigma_cholesky.diag()).sum() +
(self.mu.size()[0] / 2) * np.log(2 * np.pi))
else:
# Gradient is computed manually via _NormalizationConstant.backward()
normalization_factor = _NormalizationConstant(self.sigma, self.sigma_cholesky, inverse)
...
It seems cleaner to let PyTorch compute the gradient if it already knows how, and only override PyTorch behavior if you're doing something especially tricky.
I see your point. For the moment however I have found a more serious issue with the gradients. There are large differences between the gradients that autograd comes up with and the ones from the analytic expressions. This seems to be because of the additional structure of the covariance matrix, namely that it is symmetric. So obviously Sigma_ij and Sigma_ji are not independent variables, but autograd knows nothing about that (and I do not know of any way to pass this information to autograd). For example if one evaluates e.g.
torch.log(torch.potrf(sigma).diag()).sum().backward() autograd returns
Variable containing:
3.2004 0.0000 0.0000 0.0000
-4.9622 2.8139 0.0000 0.0000
1.1925 -1.1044 0.9342 0.0000
0.0089 -0.5959 -0.9779 0.6011
[torch.FloatTensor of size 4x4]
as the gradient of sigma, which is true but does not take the structure of sigma into account. Using sigma.inverse() - 0.5 * torch.diag(torch.diag(sigma.inverse())) ones gets
Variable containing:
3.2004 -4.9622 1.1925 0.0089
-4.9622 2.8139 -1.1044 -0.5959
1.1925 -1.1044 0.9342 -0.9779
0.0089 -0.5959 -0.9779 0.6011
[torch.FloatTensor of size 4x4].
In my opinion the latter answer is better for we know that a covariance matrix must be symmetric. One possibility would be to always compute the gradients using the analytic expression, the other to use an expression that ignores the symmetry of sigma for the normalized=False case. I would now be interested to hear your opinion on this matter before I proceed in either way.
That's a great analysis, and a convincing argument to use your original version. One more thing we should check is the sigma_tril.grad after running .backward(), in case the user is parameterizing the distribution via sigma_tril rather than sigma. My guess is that it is the same in both versions and thus we should use your computed gradient.
Thanks, in that case I will use the analytic versions. For the gradient wrt. the cholesky decomposition you are right, they are already equal out of the box.
Would you have any suggestions as to where I should put a test that ensures that the gradients are the same regardless of the value of normalized? Or do you think that such a test is unnecessary?
I would recommend a new file test/distributions/tests_multivariate_normal.py. There is already precedent for custom tests e.g. test/distributions/test_categorical.py
BTW @tbrx is working on a torch.distributions.MultivateNormal in https://github.com/probtorch/pytorch/pull/52 and we'd eventually like Pyro to simply wrap the PyTorch version. Would you mind if we move some of your code upstream to the PyTorch version (e.g. the nice *_compat helpers) and merge these two implementations?
I wouldn't mind that at all, it would be great to have multivariate normal distribution as part of standrd PyTorch. I did already look at that implementation, but I thought that that would still take some time because of the missing batched Cholesky decomposition etc., so I figured at least for the moment it is still worth working on this implementation here.
So as you can see I just opened a pull request, but the lint failed. That seems to be due to some Python 2 compatibility stuff, I will fix that tomorrow (as it rather late in my current timezone). I would like to keep this issue open none the less because that problem with the gradients made me realize that it might be worth it to also make the gradient of the rest of log_pdf aware of the symmetry of the covariance matrix. And also of course of sigma_cholesky being an upper-triangular matrix. I am going to change the title to reflect that new purpose.