Ax: How to handle MaxParallelismReachedException?

Created on 30 Mar 2020 · 2Comments · Source: facebook/Ax

Hi, I have got a error below, using Ax-platform version 0.1.10.

Traceback (most recent call last):
File "ray_run.py", line 222, in
main()
File "ray_run.py", line 212, in main
scheduler=scheduler,
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 324, in run
runner.step()
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 330, in step
next_trial = self._get_next_trial() # blocking
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 409, in _get_next_trial
self._update_trial_queue(blocking=wait_for_trial)
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 674, in _update_trial_queue
trials = self._search_alg.next_trials()
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/suggest/suggestion.py", line 62, in next_trials
for trial in self._trial_generator:
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/suggest/suggestion.py", line 83, in _generate_trials
suggested_config = self._suggest(trial_id)
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ray/tune/suggest/ax.py", line 70, in _suggest
parameters, trial_index = self._ax.get_next_trial()
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ax/service/ax_client.py", line 281, in get_next_trial
trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ax/service/ax_client.py", line 870, in _gen_new_generator_run
experiment=self.experiment
File "/home/youngmin/anaconda3/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 293, in gen
step=self._curr, num_running=num_running
ax.modelbridge.generation_strategy.MaxParallelismReachedException: Maximum parallelism for generation step #1 (GPEI) has been reached: 3 trials are currently 'running'. Some trials need to be completed before more trials can be generated. See https://ax.dev/docs/bayesopt.html to understand why limited parallelism improves performance of Bayesian optimization.

By using Ax-platform version 0.1.9, I have no error. My code is like below. What should I do?

`
def main():
from income.config import data_config, simulator_config, training_config, model_config

num_gpus = torch.cuda.device_count()

# test
max_concurrent = 8
gpu_share = 0.25
cpu_share = 1
num_total_trials = 250

train_max_epochs = training_config['train_max_epochs']
auto_save_least_interval = training_config['auto_save_least_interval']
#ray.init(address='192.168.0.2:6389', redis_password='5241590000000000')
ray.init(log_to_driver=False)

assert ray.is_initialized() is True

# 1. Define hyperparameter search algorithm
client = AxClient(enforce_sequential_optimization=False)
client.create_experiment(parameters=ray_search_spaces, objective_name=training_config['val_metric'], minimize=True)
search_algo = AxSearch(client, max_concurrent=max_concurrent)

# 2. Define scheduler
scheduler = AsyncHyperBandScheduler(time_attr='epoch',
                                    metric=training_config['val_metric'],
                                    mode='min',
                                    max_t=train_max_epochs,
                                    grace_period=auto_save_least_interval,
                                    reduction_factor=3),
scheduler = None

# 3. Run Ray tune
resources = {'gpu': gpu_share, 'cpu': cpu_share}
analysis = tune.run(TrainIncome,
                    name='ax',
                    stop={"epoch": train_max_epochs, "early_stopping": 1},
                    resources_per_trial=resources,
                    num_samples=num_total_trials,  # number of total trials
                    checkpoint_freq=1,
                    checkpoint_at_end=True,
                    search_alg=search_algo,
                    scheduler=scheduler,
                    resume=True,
                    )

print("Best config is:", analysis.get_best_config(metric=training_config['val_metric']))

ray.shutdown()
assert ray.is_initialized() is False

if __name__ == "__main__":
main()
`

documentation fixready question

Source

joyoungmin712

Most helpful comment

Hi, @joyoungmin712, great question! The release notes for our latest version introduce this exception and mention, at a high level, how to handle it, and I'll elaborate on those ways of handling. The right way to handle it depends on how important it is to have higher concurrency and a lot of trials in a given use case (vs. having lower concurrency and fewer trials).

That exception exists because limiting the number of running trials during the Bayesian optimization phase of the experiment benefits the performance of our modeling stack and allows the optimization to find optimal results in fewer trials. So the best way to handle that exception is to limit max_concurrent value to 3, as asked in the MaxParallelismReached exception message.

If there is a good reason to run 8 trials concurrently and you are running a lot of trials (which it seems like you are, 250), then we could prevent the exception by passing a manual GenerationStrategy to AxClient on instantiation. You would need to mimic the output of choose_generation_strategy, but with a different max_parallelism value for the Bayesian optimization stage, like so:

client = AxClient(
    generation_strategy=GenerationStrategy(
        steps=[
            # kwargs as here: https://github.com/facebook/Ax/blob/master/ax/modelbridge/dispatch_utils.py#L181-L195
            _make_sobol_step(...),
            _make_botorch_step(..., max_parallelism=max_concurrent),
        ],
    )
)

With all this said, we do need to think about how to allow higher max parallelism in a more convenient way for use cases like Ax+Ray –– for instance, by applying enforce_sequential_optimization=False to max parallism as well as the cases where more data is needed to proceed to the next model. Some more convenient handling for this will be included in the next version!

lena-kashtelyan on 30 Mar 2020

👍2

All 2 comments

client = AxClient(
    generation_strategy=GenerationStrategy(
        steps=[
            # kwargs as here: https://github.com/facebook/Ax/blob/master/ax/modelbridge/dispatch_utils.py#L181-L195
            _make_sobol_step(...),
            _make_botorch_step(..., max_parallelism=max_concurrent),
        ],
    )
)

lena-kashtelyan on 30 Mar 2020

👍2

Added a section on how to handle this exception to the bottom of the Service API tutorial in the freshly released stable version of Ax: https://ax.dev/tutorials/gpei_hartmann_service.html. There is more general information about parallelism in Service API in the tutorial now, too.

lena-kashtelyan on 17 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings