This is an issue related to https://github.com/facebook/Ax/issues/228, https://github.com/facebook/Ax/issues/99, https://github.com/facebook/Ax/issues/308 and maybe some others as welll, but I'm more interested in two things:
Using the service API, my experiment is setup as follows:
ax_client.create_experiment(
name="test_dct_slpEnrg",
parameters=[
{
"name" : "w1",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0e-1, 1.0e2]
},
{
"name" : "w2",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0, 1.0e2]
},
{
"name" : "w3",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0e-3, 1.0]
},
{
"name" : "w4",
"type" : "range",
"value_type" : "int",
"bounds" : [10, 20]
},
{
"name" : "w5",
"type" : "range",
"value_type" : "int",
"bounds" : [2, 20]
}
],
objective_name ="Tc2_slpEnrg",
minimize=True,
parameter_constraints = [ "w4 >= w5", "w2 - w1 >= 0"
],
outcome_constraints = ["slp_speed <= 3", "engn_trq >= 0.001"],
choose_generation_strategy_kwargs=
{
"num_initialization_trials" : num_init,
"winsorize_botorch_model": True,
"winsorization_limits": (0.0, 0.3)
}
)
The sampled parameters are input into an evaluation function which internally runs an optimization routine which either converges and outputs a valid Tc2_slpEnrg, slp_speed, engn_trq value. A valid value is indicated by a slp_speed <=3 which I have also placed as an outcome_constraint. I was unsure of how to deal with parameter values which were 'invalid (non-convergence)' as discussed in https://github.com/facebook/Ax/issues/372.
Currently, the approach I am taking is for the intial Sobel steps, I use abandon_trial for values which do not converge and after the Sobel steps, in order to discourage the model for sampling from nearby-parameters which ended up being invalid, I set the objective value to a high value of 3000 which is not too high, but very unlikely to normally occur.
I think this is this is the main reason why the instability is occurring as nearby values can be very noisy and the objective can jump between ranges of 1000 to 3000, despite very small changes in the parameters. This is why I'd like to sample from a small neighborhood around the generated trial parameter and compute a mean to return as the value. I'm unsure if Ax supports this feature or if it's something I would need to set through Botorch.
However, I have also tried to abandon these parameter values (during the GPEI step) and I would still run into these errors, so I am unsure what the actual issue is and how to resolve it.
Here is a snippet of the trace when the error occurs, note that I am periodically outputting the best parameter values so far since it completely fails when this Runtime Error occurs:
[INFO 09-04 11:32:32] ax.service.ax_client: Generated new trial 630 with parameters {'w1': 19.57, 'w2': 72.48, 'w3': 0.77, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:33] ax.service.ax_client: Completed trial 630 with data: {'Tc2_slpEnrg': (1087.14, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.51, 0.0), 'engn_trq': (12.7, 0.0)}.
Completed 125 of 500 trials
[INFO 09-04 11:32:36] ax.service.ax_client: Generated new trial 631 with parameters {'w1': 49.14, 'w2': 59.68, 'w3': 0.27, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:37] ax.service.ax_client: Completed trial 631 with data: {'Tc2_slpEnrg': (1082.28, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.46, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 126 of 500 trials
[INFO 09-04 11:32:40] ax.service.ax_client: Generated new trial 632 with parameters {'w1': 37.7, 'w2': 59.59, 'w3': 0.09, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:32:42] ax.service.ax_client: Completed trial 632 with data: {'Tc2_slpEnrg': (1084.8, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.39, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 127 of 500 trials
[INFO 09-04 11:32:45] ax.service.ax_client: Generated new trial 633 with parameters {'w1': 71.5, 'w2': 82.59, 'w3': 0.27, 'w4': 20, 'w5': 15}.
Did not converge: (3.869655369315524, 0.0). Setting value to 3000
[INFO 09-04 11:32:48] ax.service.ax_client: Completed trial 633 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.87, 0.0), 'engn_trq': (12.63, 0.0)}.
Completed 128 of 500 trials
[INFO 09-04 11:32:51] ax.service.ax_client: Generated new trial 634 with parameters {'w1': 45.01, 'w2': 66.33, 'w3': 0.15, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:52] ax.service.ax_client: Completed trial 634 with data: {'Tc2_slpEnrg': (1072.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.86, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 129 of 500 trials
[INFO 09-04 11:32:56] ax.service.ax_client: Generated new trial 635 with parameters {'w1': 53.86, 'w2': 58.84, 'w3': 0.06, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:57] ax.service.ax_client: Completed trial 635 with data: {'Tc2_slpEnrg': (1087.49, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (1.98, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 130 of 500 trials
[INFO 09-04 11:33:00] ax.service.ax_client: Generated new trial 636 with parameters {'w1': 43.28, 'w2': 67.07, 'w3': 0.29, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:33:01] ax.service.ax_client: Completed trial 636 with data: {'Tc2_slpEnrg': (1083.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.41, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 131 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15} {'slp_speed': 2.3737070108366964, 'engn_trq': 12.243898660289364, 'Tc2_slpEnrg': 1030.4849595920462, 'max_abs_Jerk': 4.059934646446776}
Completed 131 of 500 trials
[INFO 09-04 11:33:04] ax.service.ax_client: Generated new trial 637 with parameters {'w1': 27.6, 'w2': 64.84, 'w3': 0.2, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:05] ax.service.ax_client: Completed trial 637 with data: {'Tc2_slpEnrg': (1075.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.97, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 132 of 500 trials
[INFO 09-04 11:33:09] ax.service.ax_client: Generated new trial 638 with parameters {'w1': 44.71, 'w2': 62.99, 'w3': 0.25, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:10] ax.service.ax_client: Completed trial 638 with data: {'Tc2_slpEnrg': (1089.75, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.05, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 133 of 500 trials
[INFO 09-04 11:33:13] ax.service.ax_client: Generated new trial 639 with parameters {'w1': 20.52, 'w2': 70.79, 'w3': 0.8, 'w4': 17, 'w5': 9}.
Did not converge: (3.0800594621335904, 0.0). Setting value to 3000
[INFO 09-04 11:33:14] ax.service.ax_client: Completed trial 639 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.08, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 134 of 500 trials
[INFO 09-04 11:33:18] ax.service.ax_client: Generated new trial 640 with parameters {'w1': 36.74, 'w2': 60.21, 'w3': 0.43, 'w4': 15, 'w5': 9}.
Did not converge: (86.95642479778826, 0.0). Setting value to 3000
[INFO 09-04 11:33:18] ax.service.ax_client: Completed trial 640 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (2.26, 0.0), 'slp_speed': (86.96, 0.0), 'engn_trq': (70.0, 0.0)}.
Completed 135 of 500 trials
[INFO 09-04 11:33:22] ax.service.ax_client: Generated new trial 641 with parameters {'w1': 13.41, 'w2': 66.27, 'w3': 0.18, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:33:23] ax.service.ax_client: Completed trial 641 with data: {'Tc2_slpEnrg': (1073.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.88, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 136 of 500 trials
[INFO 09-04 11:33:27] ax.service.ax_client: Generated new trial 642 with parameters {'w1': 28.77, 'w2': 66.15, 'w3': 0.22, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:28] ax.service.ax_client: Completed trial 642 with data: {'Tc2_slpEnrg': (1074.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.98, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 137 of 500 trials
[INFO 09-04 11:33:32] ax.service.ax_client: Generated new trial 643 with parameters {'w1': 25.92, 'w2': 67.46, 'w3': 0.79, 'w4': 17, 'w5': 9}.
Did not converge: (3.105967542260089, 0.0). Setting value to 3000
[INFO 09-04 11:33:33] ax.service.ax_client: Completed trial 643 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.11, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 138 of 500 trials
[INFO 09-04 11:33:37] ax.service.ax_client: Generated new trial 644 with parameters {'w1': 37.94, 'w2': 66.57, 'w3': 0.32, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:38] ax.service.ax_client: Completed trial 644 with data: {'Tc2_slpEnrg': (1086.47, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.48, 0.0), 'engn_trq': (12.61, 0.0)}.
Completed 139 of 500 trials
[INFO 09-04 11:33:41] ax.service.ax_client: Generated new trial 645 with parameters {'w1': 38.19, 'w2': 65.75, 'w3': 0.19, 'w4': 18, 'w5': 9}.
[INFO 09-04 11:33:43] ax.service.ax_client: Completed trial 645 with data: {'Tc2_slpEnrg': (1072.85, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.8, 0.0), 'engn_trq': (12.71, 0.0)}.
Completed 140 of 500 trials
[INFO 09-04 11:33:47] ax.service.ax_client: Generated new trial 646 with parameters {'w1': 37.22, 'w2': 65.46, 'w3': 0.23, 'w4': 17, 'w5': 8}.
[INFO 09-04 11:33:47] ax.service.ax_client: Completed trial 646 with data: {'Tc2_slpEnrg': (1085.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.56, 0.0), 'engn_trq': (12.53, 0.0)}.
Completed 141 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15} {'slp_speed': 2.373707491338672, 'engn_trq': 12.243896324181325, 'Tc2_slpEnrg': 1030.4849537960213, 'max_abs_Jerk': 4.059934784356625}
Completed 141 of 500 trials
Traceback (most recent call last):
File "/home/mlab/gitRepo/cvt_opt/cvt_bayes_opt/dct_service_debug.py", line 170, in <module>
trial_params, trial_index = ax_client.get_next_trial()
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 275, in get_next_trial
trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 865, in _gen_new_generator_run
experiment=self.experiment
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 376, in gen
keywords=get_function_argument_names(model.gen),
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/base.py", line 626, in gen
model_gen_options=model_gen_options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/array.py", line 238, in _gen
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/torch.py", line 260, in _model_best_point
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 458, in best_point
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 353, in recommend_best_observed_point
options=model_gen_options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 296, in best_observed_point
options=options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 399, in best_in_sample_point
f, cov = as_array(model.predict(X_obs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 314, in predict
return self.model_predictor(model=self.model, X=X) # pyre-ignore [28]
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/utils.py", line 454, in predict_from_model
posterior = model.posterior(X)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
mvn = self(X)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 328, in __call__
predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 302, in exact_prediction
self.exact_predictive_mean(test_mean, test_train_covar),
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 320, in exact_predictive_mean
res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
add_to_cache(self, cache_name, method(self, *args, **kwargs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 269, in mean_cache
mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 934, in inv_matmul
return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 47, in forward
solves = _solve(lazy_tsr, right_tensor)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 11, in _solve
return lazy_tsr._cholesky()._cholesky_solve(rhs)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
add_to_cache(self, cache_name, method(self, *args, **kwargs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 414, in _cholesky
cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 48, in psd_safe_cholesky
raise e
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 25, in psd_safe_cholesky
L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 2: U(99,99) is zero, singular U.
Please let me know what your thoughts are about my problem and how I should proceed. Thanks!
I have 3 suggestions to start, but I'll need to follow up after more research:
Ok, thank you for the suggestions.
What exactly do you mean by "add robustness"?
Also, for repeated sampling, if I am using GP+EI for generating a sample using the get_best_parameter() function. If I don't add any additional trial data, will it simply return the same sample? Or is there some randomness/noise to the sample that is returned?
Does Ax provide facilities that I could get a "noisy" sample so that it would help for repeated sampling?
Thanks.
@jangkj09
When I say add robustness or repeated sampling, you'll have to add them to the evaluate function manually. So your evaluation function can add different small amount of noise to the parameters, calculate the result, then return the average of all successful results. This technically changes the function you're evaluating slightly, but our algorithms will handle the change just fine.
We don't have any utilities that help this right now and no near term plans to add them.
Closing this and moving discussion to #228, since the ways to address the issue are mostly the same in the two cases.
@lena-kashtelyan @2timesjay Thanks for all your help! I will open other issues if I run into them. Thanks!
Most helpful comment
I have 3 suggestions to start, but I'll need to follow up after more research: