Ax: Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U.

Created on 4 Sep 2020 · 5Comments · Source: facebook/Ax

This is an issue related to https://github.com/facebook/Ax/issues/228, https://github.com/facebook/Ax/issues/99, https://github.com/facebook/Ax/issues/308 and maybe some others as welll, but I'm more interested in two things:

Is there a way to debug what is causing the failure? I think it is related to the bad-conditioning on the underlying GP model, but I'm not sure how to confirm this.
When given a set of parameters for a trial generated from the model, is there a way to sample repeatedly in the neighborhood and return a mean and variance for the trial?

Using the service API, my experiment is setup as follows:

ax_client.create_experiment(
        name="test_dct_slpEnrg",
        parameters=[
            {
                "name" : "w1",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-1, 1.0e2]
            },
            {
                "name" : "w2",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0, 1.0e2]
            },
            {
                "name" : "w3",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-3, 1.0]
            },
            {
                "name" : "w4",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [10, 20]
            },
            {
                "name" : "w5",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [2, 20]
            }
        ],
        objective_name ="Tc2_slpEnrg",
        minimize=True,
        parameter_constraints = [ "w4 >= w5", "w2 - w1 >= 0"
        ],
        outcome_constraints = ["slp_speed <= 3", "engn_trq >= 0.001"],
        choose_generation_strategy_kwargs=
            {
                "num_initialization_trials" : num_init,
                "winsorize_botorch_model": True,
                "winsorization_limits": (0.0, 0.3)
            }
    )

The sampled parameters are input into an evaluation function which internally runs an optimization routine which either converges and outputs a valid Tc2_slpEnrg, slp_speed, engn_trq value. A valid value is indicated by a slp_speed <=3 which I have also placed as an outcome_constraint. I was unsure of how to deal with parameter values which were 'invalid (non-convergence)' as discussed in https://github.com/facebook/Ax/issues/372.

Currently, the approach I am taking is for the intial Sobel steps, I use abandon_trial for values which do not converge and after the Sobel steps, in order to discourage the model for sampling from nearby-parameters which ended up being invalid, I set the objective value to a high value of 3000 which is not too high, but very unlikely to normally occur.

I think this is this is the main reason why the instability is occurring as nearby values can be very noisy and the objective can jump between ranges of 1000 to 3000, despite very small changes in the parameters. This is why I'd like to sample from a small neighborhood around the generated trial parameter and compute a mean to return as the value. I'm unsure if Ax supports this feature or if it's something I would need to set through Botorch.

However, I have also tried to abandon these parameter values (during the GPEI step) and I would still run into these errors, so I am unsure what the actual issue is and how to resolve it.

Here is a snippet of the trace when the error occurs, note that I am periodically outputting the best parameter values so far since it completely fails when this Runtime Error occurs:

[INFO 09-04 11:32:32] ax.service.ax_client: Generated new trial 630 with parameters {'w1': 19.57, 'w2': 72.48, 'w3': 0.77, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:33] ax.service.ax_client: Completed trial 630 with data: {'Tc2_slpEnrg': (1087.14, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.51, 0.0), 'engn_trq': (12.7, 0.0)}.
Completed 125 of 500 trials
[INFO 09-04 11:32:36] ax.service.ax_client: Generated new trial 631 with parameters {'w1': 49.14, 'w2': 59.68, 'w3': 0.27, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:37] ax.service.ax_client: Completed trial 631 with data: {'Tc2_slpEnrg': (1082.28, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.46, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 126 of 500 trials
[INFO 09-04 11:32:40] ax.service.ax_client: Generated new trial 632 with parameters {'w1': 37.7, 'w2': 59.59, 'w3': 0.09, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:32:42] ax.service.ax_client: Completed trial 632 with data: {'Tc2_slpEnrg': (1084.8, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.39, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 127 of 500 trials
[INFO 09-04 11:32:45] ax.service.ax_client: Generated new trial 633 with parameters {'w1': 71.5, 'w2': 82.59, 'w3': 0.27, 'w4': 20, 'w5': 15}.
Did not converge: (3.869655369315524, 0.0). Setting value to 3000
[INFO 09-04 11:32:48] ax.service.ax_client: Completed trial 633 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.87, 0.0), 'engn_trq': (12.63, 0.0)}.
Completed 128 of 500 trials
[INFO 09-04 11:32:51] ax.service.ax_client: Generated new trial 634 with parameters {'w1': 45.01, 'w2': 66.33, 'w3': 0.15, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:52] ax.service.ax_client: Completed trial 634 with data: {'Tc2_slpEnrg': (1072.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.86, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 129 of 500 trials
[INFO 09-04 11:32:56] ax.service.ax_client: Generated new trial 635 with parameters {'w1': 53.86, 'w2': 58.84, 'w3': 0.06, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:57] ax.service.ax_client: Completed trial 635 with data: {'Tc2_slpEnrg': (1087.49, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (1.98, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 130 of 500 trials
[INFO 09-04 11:33:00] ax.service.ax_client: Generated new trial 636 with parameters {'w1': 43.28, 'w2': 67.07, 'w3': 0.29, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:33:01] ax.service.ax_client: Completed trial 636 with data: {'Tc2_slpEnrg': (1083.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.41, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 131 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.3737070108366964, 'engn_trq': 12.243898660289364, 'Tc2_slpEnrg': 1030.4849595920462, 'max_abs_Jerk': 4.059934646446776}
Completed 131 of 500 trials
[INFO 09-04 11:33:04] ax.service.ax_client: Generated new trial 637 with parameters {'w1': 27.6, 'w2': 64.84, 'w3': 0.2, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:05] ax.service.ax_client: Completed trial 637 with data: {'Tc2_slpEnrg': (1075.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.97, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 132 of 500 trials
[INFO 09-04 11:33:09] ax.service.ax_client: Generated new trial 638 with parameters {'w1': 44.71, 'w2': 62.99, 'w3': 0.25, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:10] ax.service.ax_client: Completed trial 638 with data: {'Tc2_slpEnrg': (1089.75, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.05, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 133 of 500 trials
[INFO 09-04 11:33:13] ax.service.ax_client: Generated new trial 639 with parameters {'w1': 20.52, 'w2': 70.79, 'w3': 0.8, 'w4': 17, 'w5': 9}.
Did not converge: (3.0800594621335904, 0.0). Setting value to 3000
[INFO 09-04 11:33:14] ax.service.ax_client: Completed trial 639 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.08, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 134 of 500 trials
[INFO 09-04 11:33:18] ax.service.ax_client: Generated new trial 640 with parameters {'w1': 36.74, 'w2': 60.21, 'w3': 0.43, 'w4': 15, 'w5': 9}.
Did not converge: (86.95642479778826, 0.0). Setting value to 3000
[INFO 09-04 11:33:18] ax.service.ax_client: Completed trial 640 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (2.26, 0.0), 'slp_speed': (86.96, 0.0), 'engn_trq': (70.0, 0.0)}.
Completed 135 of 500 trials
[INFO 09-04 11:33:22] ax.service.ax_client: Generated new trial 641 with parameters {'w1': 13.41, 'w2': 66.27, 'w3': 0.18, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:33:23] ax.service.ax_client: Completed trial 641 with data: {'Tc2_slpEnrg': (1073.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.88, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 136 of 500 trials
[INFO 09-04 11:33:27] ax.service.ax_client: Generated new trial 642 with parameters {'w1': 28.77, 'w2': 66.15, 'w3': 0.22, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:28] ax.service.ax_client: Completed trial 642 with data: {'Tc2_slpEnrg': (1074.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.98, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 137 of 500 trials
[INFO 09-04 11:33:32] ax.service.ax_client: Generated new trial 643 with parameters {'w1': 25.92, 'w2': 67.46, 'w3': 0.79, 'w4': 17, 'w5': 9}.
Did not converge: (3.105967542260089, 0.0). Setting value to 3000
[INFO 09-04 11:33:33] ax.service.ax_client: Completed trial 643 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.11, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 138 of 500 trials
[INFO 09-04 11:33:37] ax.service.ax_client: Generated new trial 644 with parameters {'w1': 37.94, 'w2': 66.57, 'w3': 0.32, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:38] ax.service.ax_client: Completed trial 644 with data: {'Tc2_slpEnrg': (1086.47, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.48, 0.0), 'engn_trq': (12.61, 0.0)}.
Completed 139 of 500 trials
[INFO 09-04 11:33:41] ax.service.ax_client: Generated new trial 645 with parameters {'w1': 38.19, 'w2': 65.75, 'w3': 0.19, 'w4': 18, 'w5': 9}.
[INFO 09-04 11:33:43] ax.service.ax_client: Completed trial 645 with data: {'Tc2_slpEnrg': (1072.85, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.8, 0.0), 'engn_trq': (12.71, 0.0)}.
Completed 140 of 500 trials
[INFO 09-04 11:33:47] ax.service.ax_client: Generated new trial 646 with parameters {'w1': 37.22, 'w2': 65.46, 'w3': 0.23, 'w4': 17, 'w5': 8}.
[INFO 09-04 11:33:47] ax.service.ax_client: Completed trial 646 with data: {'Tc2_slpEnrg': (1085.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.56, 0.0), 'engn_trq': (12.53, 0.0)}.
Completed 141 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.373707491338672, 'engn_trq': 12.243896324181325, 'Tc2_slpEnrg': 1030.4849537960213, 'max_abs_Jerk': 4.059934784356625}
Completed 141 of 500 trials


Traceback (most recent call last):
  File "/home/mlab/gitRepo/cvt_opt/cvt_bayes_opt/dct_service_debug.py", line 170, in <module>
    trial_params, trial_index = ax_client.get_next_trial()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 275, in get_next_trial
    trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 865, in _gen_new_generator_run
    experiment=self.experiment
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 376, in gen
    keywords=get_function_argument_names(model.gen),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/base.py", line 626, in gen
    model_gen_options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/array.py", line 238, in _gen
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/torch.py", line 260, in _model_best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 458, in best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 353, in recommend_best_observed_point
    options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 296, in best_observed_point
    options=options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 399, in best_in_sample_point
    f, cov = as_array(model.predict(X_obs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 314, in predict
    return self.model_predictor(model=self.model, X=X)  # pyre-ignore [28]
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/utils.py", line 454, in predict_from_model
    posterior = model.posterior(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
    mvn = self(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 328, in __call__
    predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 302, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 320, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 269, in mean_cache
    mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 934, in inv_matmul
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 47, in forward
    solves = _solve(lazy_tsr, right_tensor)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 11, in _solve
    return lazy_tsr._cholesky()._cholesky_solve(rhs)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 414, in _cholesky
    cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 48, in psd_safe_cholesky
    raise e
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 25, in psd_safe_cholesky
    L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 2: U(99,99) is zero, singular U.

Please let me know what your thoughts are about my problem and how I should proceed. Thanks!

question

Source

jangkj09

Most helpful comment

I have 3 suggestions to start, but I'll need to follow up after more research:

Definitely do not use this "3000" dummy value. The GP will have a very hard time fitting it. We don't have a systematic way to deal with failed trials like this, unfortunately, except for...
Add robustness, retries, repeated sampling, or other strategies to the evaluation function directly. If you followed the service API tutorial, https://ax.dev/tutorials/gpei_hartmann_service.html#3.-Define-how-to-evaluate-trials defines the "evaluate" function and can accomodate any process you want to retry evaluations and reduce variance (but we don't offer any utilities to help this directly).
If #1 + #2 reduces the number of trials you have to run and the number of non-convergent evaluations, you should be less likely to encounter Cholesky errors.

2timesjay on 8 Sep 2020

👍2

All 5 comments

I have 3 suggestions to start, but I'll need to follow up after more research:

Definitely do not use this "3000" dummy value. The GP will have a very hard time fitting it. We don't have a systematic way to deal with failed trials like this, unfortunately, except for...
Add robustness, retries, repeated sampling, or other strategies to the evaluation function directly. If you followed the service API tutorial, https://ax.dev/tutorials/gpei_hartmann_service.html#3.-Define-how-to-evaluate-trials defines the "evaluate" function and can accomodate any process you want to retry evaluations and reduce variance (but we don't offer any utilities to help this directly).
If #1 + #2 reduces the number of trials you have to run and the number of non-convergent evaluations, you should be less likely to encounter Cholesky errors.

2timesjay on 8 Sep 2020

👍2

Ok, thank you for the suggestions.
What exactly do you mean by "add robustness"?

Also, for repeated sampling, if I am using GP+EI for generating a sample using the get_best_parameter() function. If I don't add any additional trial data, will it simply return the same sample? Or is there some randomness/noise to the sample that is returned?

Does Ax provide facilities that I could get a "noisy" sample so that it would help for repeated sampling?

Thanks.

jangkj09 on 9 Sep 2020

@jangkj09

When I say add robustness or repeated sampling, you'll have to add them to the evaluate function manually. So your evaluation function can add different small amount of noise to the parameters, calculate the result, then return the average of all successful results. This technically changes the function you're evaluating slightly, but our algorithms will handle the change just fine.

We don't have any utilities that help this right now and no near term plans to add them.

2timesjay on 22 Sep 2020

👍1

Closing this and moving discussion to #228, since the ways to address the issue are mostly the same in the two cases.