Pyro: Implement variational GP where pseudo inputs are learned using SVI

Created on 24 Jan 2018  Â·  19Comments  Â·  Source: pyro-ppl/pyro

@fehiepsi 's first version of Gaussian Processes uses the full dataset. For large scale problems it is cheaper to use s small set of inducing points. Let's try to learn these points this using Pyro.

Tasks

enhancement

Most helpful comment

After reading a few references in the blog post suggested by @eb8680, I find that in most non-Gaussian likelihood cases, it is unable to obtain the exact formula for the posterior as in the current implementation of GPR and SGPR. Hence, I will follow the approach that @eb8680 suggested in https://github.com/uber/pyro/issues/702#issuecomment-361780381. It is called Sparse Variational GP. My plan is:

  • In the upcoming weeks, I will implement both Variational GP and its sparse version. On the way, I will introduce the Likelihood module.
  • After that, I think that we already have enough models to start refactoring our implementation. For example, there are some problems with the current implementation: how to put constraints on parameters of models; how to define priors for parameters in case the kernel is constructed from various sub-kernels,...
  • Add more kernels, likelihoods, mean functions.
  • Test HMC inference method.
  • Implement other models such as GPLVM / deep GP?

All 19 comments

cc @ysaatchi @fehiepsi

I'm searching for simple tasks to begin contributing to Pyro. I'd be happy
to add more kernels; should I start with Matern 3/2 and 5/2/

Best,
Shannon Sequeira

On Tue, Jan 23, 2018 at 7:08 PM, Fritz Obermeyer notifications@github.com
wrote:

cc @ysaatchi https://github.com/ysaatchi @fehiepsi
https://github.com/fehiepsi

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/uber/pyro/issues/702#issuecomment-359986399, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJspM7R-hkqlIY82u2BK7m_nCS6xldnEks5tNoKRgaJpZM4RqkTs
.

Yes, that's a good place to start! Other ones could include linear, periodic, neural network and spline.
Another family which is super useful in practice are the "meta kernels", e.g. additive (kernel is sum of constituent kernels), product (kernel is element-wise product of constituent kernels). Thanks!

I will take care of implement the spare version of GPR.

@fehiepsi your setup is very conducive to VGP, you can add pseudo-inputs and targets to your hyperpriors dict and use SVI to learn them via pyro machinery (I hope!)

@ysaatchi The spare models require using "Woodbury matrix identity" to reduce cost to inverse matrices and calculate logdet. I have written down all calculations for SpareGPR with DTC and FITC assumptions. Implement them is quite straightforward and similar to the current GPR implementation.

However, for SpareGPR with VFE, there is an additional regularizer term Trace(K_theta) (formula (2.33) in Damianou's thesis). One option is to add a callback function to SVI to add this additional loss to ELBO. A better option is to write guide in a suitable way (I am getting stuck at this point) such that ELBO will have this additional term.

I am starting to implement them and hope to have a preliminary version this weekend so we can start to discuss more.

Interesting, I think your intuition RE finding a good guide makes sense. Let me think about it and get back to you.

One option is to add a callback function to SVI to add this additional loss to ELBO. A better option is to write guide in a suitable way (I am getting stuck at this point) such that ELBO will have this additional term.

You can hack terms into the ELBO via the model by adding an auxiliary Bernoulli variable with logits corresponding to the term and observing it to be 1; cf #568.

For what it's worth, I think this blog post and its accompanying references provide a much clearer and simpler exposition of variational sparse GPs than the section in that thesis. You can translate it directly into Pyro code for arbitrary likelihoods, and once #688 is implemented it should work pretty well. Here's a now-obsolete GP classification example I wrote in #166 which could easily be refactored to analytically integrate out f for a Gaussian likelihood:

def model(x):
    cov = gaussian_kernel(x, x)
    f = pyro.sample("f", dist.normalchol,
                    Variable(torch.zeros(x.size(0))),
                    torch.potrf(cov, False))
    y = pyro.sample("y", dist.bernoulli, torch.sigmoid(f))
    return y


def sparse_guide(x, n_z=10, lam=1e-5):
    z = pyro.param("z", Variable(torch.randn(n_z, x.size(1))*x.data.std() + x.data.mean(),
                                 requires_grad=True))  # inducing points
    m = pyro.param("m", Variable(torch.randn(n_z, 1), requires_grad=True))
    sigma = pyro.param("sigma", Variable(torch.randn(n_z),
                                         requires_grad=True))
    kxz = gaussian_kernel(x, z)
    kzx = gaussian_kernel(z, x)
    kxx = gaussian_kernel(x, x)
    kzz = gaussian_kernel(z, z) + Variable(torch.eye(n_z) * lam)
    kzz_inv = torch.inverse(kzz)  # torch.potri(torch.potrf(kzz), False)
    mean = torch.mm(torch.mm(kxz, kzz_inv), m).squeeze()
    cov = kxx - torch.mm(torch.mm(kxz,
                                  kzz_inv - torch.mm(kzz_inv,
                                                     torch.mm(F.softplus(torch.diag(sigma)),
                                                              kzz_inv))),
                         kzx)
    cov = cov + Variable(torch.eye(cov.size(0)) * lam)
    f = pyro.sample("f", dist.normalchol, mean, torch.potrf(cov, False))
    return f

@eb8680 In sparse approximation, from what I understand, we should avoid working with matrices of sizes num_data x num_data as much as possible. The target is to reduce the cost of learning and inference to O(num_data x num_inducing^2). Even in 'VFE' method, the theoretical part is to learn the posterior distribution q(u) ~ p(u | y). However, we can derive the optimal solution for q (using calculus of variation) and thanks to it, we can use this 'exact' optimal form in the model. Computation of loglikelihood is similar to the traditional 'DTC' method plus a trace term. I guess that for classification problem, we can also follow the same approach with additional care for the likelihood (the reference is Hensman's.

Thank you very much for your suggestion on modifying ELBO. I have come to the solution to wrap Multivariate Normal Distribution and add an additional trace term to the log_prob. The wrapper is also useful for dealing with covariance matrices of forms D + Wt.W where D is diagonal and W is a low-rank matrix of size num_inducing x num_data.

Regarding KL, #688 appears to be easy to implement via #729 and some new math for MVN

@torch.distributions.kl.register_kl(MultivariateNormal, MultivariateNormal)
def _kl_mvn_mvn(p, q):
    ...some math here...

@eb8680 @fritzo For the next step, I will try to implement Classification models. The current implementation for regression might not be ported easily. Implementing them will help me understand your comments better (e.g. how to use KL,...).

@ssequeira Adding more kernels will definitely be very helpful. Matern 3/2, 5/2, and RBF all belong to a family of stationary covariance functions, so they are good starting points. This notebook might be helpful for you to get an overview of kernels: http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_kernels.ipynb

After reading a few references in the blog post suggested by @eb8680, I find that in most non-Gaussian likelihood cases, it is unable to obtain the exact formula for the posterior as in the current implementation of GPR and SGPR. Hence, I will follow the approach that @eb8680 suggested in https://github.com/uber/pyro/issues/702#issuecomment-361780381. It is called Sparse Variational GP. My plan is:

  • In the upcoming weeks, I will implement both Variational GP and its sparse version. On the way, I will introduce the Likelihood module.
  • After that, I think that we already have enough models to start refactoring our implementation. For example, there are some problems with the current implementation: how to put constraints on parameters of models; how to define priors for parameters in case the kernel is constructed from various sub-kernels,...
  • Add more kernels, likelihoods, mean functions.
  • Test HMC inference method.
  • Implement other models such as GPLVM / deep GP?

how to put constraints on parameters of models

The new torch.distributions library handles constraints and knows how to transform constrained parameters to unconstrained spaces and back. Pyro will use this in HMC #740 and eventually also in SVI. Let me know if you have any questions or feature requests.

@fritzo : In current implementation of kernel module, variance, lengthscale are Parameters. I want to put constraints on them so they will not get non-positive values when learning. Can I achieve it easily using constraints from Pytorch?

By the way, I have another question. I want to define a Parameter which corresponds to a lower triangular matrix. There are two ways come to my head. The first way is to define a 1D vector then converts it to a lower triangular matrix. I see one of your comments in a relevant topic here. The second way is to define a 2D tensor x then call x.tril(). Pytorch has LowerCholeskyTransform but I cannot figure out how to use it. Could you please help me solve this problem?

@fehiepsi The usual way to do constrained optimization in Pyro is to transform unconstrained parameters, e.g.

variance = torch.exp(
    pyro.param("log_variance", Variable(torch.zeros(10), requires_grad=True)))
lengthscale = torch.exp(
    pyro.param("log_lengthscale", Variable(torch.zeros(10), requires_grad=True)))
x = LowerCholeskyTransform()(
    pyro.param("unconstrained_x", Variable(torch.zeros(10, 10), requires_grad=True)))

Note that LowerCholeskyTransform first calls torch.tril and then exponentiates the diagonal to avoid singular matrices (so that x is a valid Cholesky decomposition).

PyTorch 0.4 introduces a generic way to do this using transform_to():

from torch.distributions import constraints, transform_to

variance = transform_to(constraints.positive)(
    pyro.param("unconstrained_variance", ...))
lengthscale = transform_to(constraints.positive)(
    pyro.param("unconstrained_lengthscale", ...))
x = transform_to(constraints.lower_cholesky)(
    pyro.param("unconstrained_x", ...))

In your case it's probably simpler to name the transforms. In the future I'd like to use transform_to() to provide more first-class support for constraints in pyro.param, something like

# possible future syntax:
variance = pyro.param("variance", Variable(...), constrain_to=constraints.positive)

@fritzo The third version is exactly what I want. While waiting for it, when starting to refactor GP module, I will use the second version. Thanks a lot!

Just want to comment that with https://github.com/pytorch/pytorch/pull/6648, we have completed the requirements of this issue. So we can close this when that pull request is merged and we update things to pytorch master.

Closed because we have gpu torch.trtrs in pytorch master now and we don't use torch.potrs in GP module.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neerajprad picture neerajprad  Â·  4Comments

tristandeleu picture tristandeleu  Â·  3Comments

martinjankowiak picture martinjankowiak  Â·  3Comments

neerajprad picture neerajprad  Â·  4Comments

robsalomone picture robsalomone  Â·  4Comments