Ax: [Question] Odd Behavior for simple GP/BO

Created on 5 Nov 2019  路  5Comments  路  Source: facebook/Ax

I'm trying to fit a Guassian Process to a simple polynomial without noise using Ax's get_botorch function, but I'm seeing some unexpected behavior. In certain cases, the GP fails to fit to the points, even though I am using an ExactGP with a noiseless Expected Improvement acquisition function. I've modified the Using a custom botorch model with Ax tutorial to evaluate the function y = x1**2 instead of the Branin function, and to use the (analytic) ExpectedImprovement acquisition function instead of the qNoisyExpectedImprovement acquisition function, since I thought that might be what's causing the issue. I've attached plots of several past outcomes (generated by calling render(plot_slice(...)), and I'd be happy to post my code if necessary.
newplot
newplot(1)

All 5 comments

cc @Balandat, @eytan, @bletham

@kwh5484, can you please provide your code for this? This could be a number of issues (possibly related to normalization/standardization of data), but it's hard to tell without a reproducible example.

Sure thing. Here's the code I used.

from botorch.models.gpytorch import GPyTorchModel
from gpytorch.distributions import MultivariateNormal
from gpytorch.means import ConstantMean, ZeroMean
from gpytorch.models import ExactGP
from gpytorch.kernels import ScaleKernel, MaternKernel
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.mlls import ExactMarginalLogLikelihood
from ax import ParameterType, RangeParameter, SearchSpace
from ax import SimpleExperiment
from ax.modelbridge import get_sobol
from ax.modelbridge.factory import get_botorch
import torch
from ax.utils.notebook.plotting import render
from ax.plot.slice import plot_slice
from botorch.fit import fit_gpytorch_model
from botorch.acquisition import ExpectedImprovement
from botorch.optim import optimize_acqf
from botorch.models.model import Model
from torch import Tensor
from typing import Any, Callable, Dict, List, Optional, Tuple
from botorch.acquisition.acquisition import AcquisitionFunction
import gpytorch

class SimpleCustomGP(ExactGP, GPyTorchModel):

    def __init__(self, train_X, train_Y):

        #squeeze output dim before passing train_Y to ExactGP
        super().__init__(train_X, train_Y.squeeze(-1), GaussianLikelihood())
        self.mean_module = ConstantMean()
        self.covar_module = ScaleKernel(base_kernel=MaternKernel())
        self.to(train_X)    #make sure we're on the right device/dtype

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)

#does parameter fitting for SimpleCustomGP
def _get_and_fit_simple_custom_gp(Xs, Ys, **kwargs):                
    model = SimpleCustomGP(Xs[0], Ys[0])
    mll = ExactMarginalLogLikelihood(model.likelihood, model)
    fit_gpytorch_model(mll)
    return model

def simple_eval(parameterization, *args):
    x1 = parameterization["x1"]
    y = x1**2
    return {"result": (y, 0.0)}

def get_EI(
    model: Model,
    objective_weights: Tensor,
    outcome_constraints: Optional[Tuple[Tensor, Tensor]] = None,
    X_observed: Optional[Tensor] = None,
    X_pending: Optional[Tensor] = None,
    **kwargs: Any,
) -> AcquisitionFunction:
    return ExpectedImprovement(model=model, best_f=-1.0, maximize=False)

def get_optimizer(
    acq_function: AcquisitionFunction,
    bounds: Tensor,
    n: int,
    inequality_constraints: Optional[List[Tuple[Tensor, Tensor, float]]] = None,
    fixed_features: Optional[Dict[int, float]] = None,
    rounding_func: Optional[Callable[[Tensor], Tensor]] = None,
    **kwargs: Any,
) -> Tensor:
    num_restarts: int = kwargs.get("num_restarts", 20)
    raw_samples: int = kwargs.get("num_raw_samples", 50 * num_restarts)
    fn, _ = optimize_acqf(acq_function=acq_function, bounds=bounds, q=n, num_restarts=num_restarts, raw_samples=raw_samples, options={})
    return fn

simple_search_space = SearchSpace(parameters=[RangeParameter(name="x1", parameter_type=ParameterType.INT, lower=-5, upper=15),])
simple_exp = SimpleExperiment(name="test_1d", search_space=simple_search_space, evaluation_function=simple_eval, objective_name="result", minimize=True)

simple_sobol = get_sobol(simple_exp.search_space)
simple_exp.new_batch_trial(generator_run=simple_sobol.gen(1))


for i in range(10):
    print(f"Running optimization batch {i+1}/10...")

    model = get_botorch(experiment=simple_exp, 
                        data=simple_exp.eval(), 
                        search_space=simple_exp._search_space, 
                        model_constructor=_get_and_fit_simple_custom_gp, 
                        acqf_constructor=get_EI, 
                        acqf_optimizer=get_optimizer)

    batch = simple_exp.new_trial(generator_run=model.gen(1))

    render(plot_slice(model, "x1", "result"))
    print("batch: ", batch)

Thanks.

So I ran this a few times in a notebook and was able to reproduce this. If there are only 2-3 data points, the behavior that the model produces a constant fit with high variance is to be expected -- there is just not enough information in 3 data points for a non-parametric model with a Matern Kernel to make much sense of it.

The fact that in your example this occurs also for > 3 data points is most likely due to the fact that you do not put a prior on the lengthscales of your Kernel. As a result, a priori any lengthscale is equally likely to the model. However, Ax by default normalizes the inputs to the unit cube [0, 1]^d, so lengthscales > 1 are pretty meaningless.

That's why the SingleTaskGP BoTorch model Ax uses by default puts a prior on the lengthscales with very little probability mass greater than 1. If you do the same thing, that is, add a lengthscale_prior=GammaPrior(3.0, 6.0) arg to your MaternKernel, I don't see this behavior anymore for n > 3 data points in your example.

newplot

More generally speaking, if you have prior knowledge that your function has a particular structure (e.g. is a polynomial of some degree, monotonic, periodic, ...), using a different, more parametric Kernel (e.g. a polynomial Kernel) would be the way to go, and would result in more interpretable fits (assuming that the function indeed satisfies the structural assumptions).

I hope this helps.

Great, thanks for your help!

Was this page helpful?
0 / 5 - 0 ratings