Optuna: keras parallel optimization (n_jobs >1)

Created on 17 Jun 2019 · 4Comments · Source: optuna/optuna

On registering an issue, write precise explanations on how you want Optuna to be.

Bug reports must include necessary and sufficient conditions to reproduce the bugs.
More specifically, it is helpful for us if you include the following information:

Conditions

Optuna version: 0.12.0
Python version: 3.6.4
OS: Mac 10.14.5
Machine Learning library to be optimized: keras

Code to reproduce
I tried to run this keras example https://github.com/pfnet/optuna/blob/master/examples/pruning/keras_integration.py
with n_jobs > 1

study = optuna.create_study(direction='maximize', pruner=optuna.pruners.MedianPruner())
study.optimize(objective, n_trials=100, n_jobs = 4)

Error messages, stack traces, or logs

tensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor dense_2_target:0, specified in either feed_devices or fetch_devices was not found in the Graph
Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x137fcb860>>
Traceback (most recent call last):
  File "/Users/fw/anaconda3/envs/catalyst/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1455, in __del__
    self._session._session, self._handle, status)
  File "/Users/fw/anaconda3/envs/catalyst/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140672063075296

What is a proper solution to optimize Keras with n_jobs > 1? Many thanks

Source

superluminance

Most helpful comment

This worked for me!

# hide all deprecation warnings from tensorflow
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

import optuna
import gym
import numpy as np

from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common.cmd_util import make_vec_env

def optimize_ppo2(trial):
    """ Learning hyperparamters we want to optimise"""
    return {
        'n_steps': int(trial.suggest_loguniform('n_steps', 16, 2048)),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1.),
        'ent_coef': trial.suggest_loguniform('ent_coef', 1e-8, 1e-1),
        'cliprange': trial.suggest_uniform('cliprange', 0.1, 0.4),
        'noptepochs': int(trial.suggest_loguniform('noptepochs', 1, 48)),
        'lam': trial.suggest_uniform('lam', 0.8, 1.)
    }


def optimize_agent(trial):
    """ Train the model and optimize
        Optuna maximises the negative log likelihood, so we
        need to negate the reward here
    """
    model_params = optimize_ppo2(trial)
    # env = DummyVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
    env = make_vec_env(lambda: gym.make('CartPole-v1'), n_envs=16, seed=0)
    model = PPO2('MlpPolicy', env, verbose=0, nminibatches=1, **model_params)
    model.learn(10000)
    mean_reward, _ = evaluate_policy(model, gym.make('CartPole-v1'), n_eval_episodes=10)

    return -1 * mean_reward

if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(optimize_agent, n_trials=100, n_jobs=4)

josiahcoad on 20 Jun 2020

👍2

All 4 comments

I think the issue is mainly caused by Tensorflow, which is the default backend of keras. It requires a special care to work with multi-threading because it has a session as a global variable.

In the case of this issue, a Tensorflow session is shared by multiple trials. The trials try to update a single computational graph in the session, and they destroy the graph. To avoid this issue, we need to create a separated session for each trial.

I have two workarounds.

Use K.set_session()

We can specify the Tensorflow session by using K.set_session() function of keras. To create a separated session for each trial, we can modify the objective function as follows:

def objective(trial):
    # Clear clutter form previous session graphs.
    # keras.backend.clear_session()  # comment out this line

    # The data is split between train and test sets.
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    ...

def objective_tf(trial):
    # Clear clutter form previous session graphs.
    keras.backend.clear_session()

    with tf.Graph().as_default():
        with tf.Session() as sess:
            K.set_session(sess)
            return objective(trial)

if __name__ == '__main__':
    study = optuna.create_study(direction='maximize', pruner=optuna.pruners.MedianPruner())
    study.optimize(objective_tf, n_trials=4, n_jobs=4)
    ...

This workaround still has a minor problem. Although all trials successfully completed, I sometimes encountered the following exceptions:

tensorflow.python.framework.errors_impl.CancelledError: Session has been closed.

Create multiple studies

Instead of using n_jobs option of Study.optmize, you can use distributed optimization feature of optuna to run trials in parallel.

All you need to do is share the study_name of Study objects among multiple processes.
First, please specify storage and study_name as follows:

if __name__ == '__main__':
    study = optuna.create_study(storage='sqlite:///foo.db', study_name='keras-parallel', direction='maximize', pruner=optuna.pruners.MedianPruner(), load_if_exists=True)
    study.optimize(objective, n_trials=100, n_jobs=1)

And then, please run studies from multiple processes as follows:

Process 1

$ python examples/prunint/keras_integration.py

Process 2

$ python examples/prunint/keras_integration.py

Trials in the processes do not share the sessions while they share the study history via storage.
We recommend this approach because it is simple and straightforward.

Reference

Graphs and Sessions

toshihikoyanase on 17 Jun 2019

@toshihikoyanase Thank you for your reply. I tried both methods and they work great except the CancelledError you mentioned. I've been running the db version with multiprocessing.Pool().
It's solved my issue. I am closing this issue.

superluminance on 18 Jun 2019

👍1

This worked for me!

# hide all deprecation warnings from tensorflow
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

import optuna
import gym
import numpy as np

from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common.cmd_util import make_vec_env

def optimize_ppo2(trial):
    """ Learning hyperparamters we want to optimise"""
    return {
        'n_steps': int(trial.suggest_loguniform('n_steps', 16, 2048)),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1.),
        'ent_coef': trial.suggest_loguniform('ent_coef', 1e-8, 1e-1),
        'cliprange': trial.suggest_uniform('cliprange', 0.1, 0.4),
        'noptepochs': int(trial.suggest_loguniform('noptepochs', 1, 48)),
        'lam': trial.suggest_uniform('lam', 0.8, 1.)
    }


def optimize_agent(trial):
    """ Train the model and optimize
        Optuna maximises the negative log likelihood, so we
        need to negate the reward here
    """
    model_params = optimize_ppo2(trial)
    # env = DummyVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
    env = make_vec_env(lambda: gym.make('CartPole-v1'), n_envs=16, seed=0)
    model = PPO2('MlpPolicy', env, verbose=0, nminibatches=1, **model_params)
    model.learn(10000)
    mean_reward, _ = evaluate_policy(model, gym.make('CartPole-v1'), n_eval_episodes=10)

    return -1 * mean_reward

if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(optimize_agent, n_trials=100, n_jobs=4)

josiahcoad on 20 Jun 2020

👍2

This no longer seems consistent for TF 2.x. Is there support for parallel trials for TF 2.x or another workaround?