The RLLib converges slowly on a simple environment compared to comparable algorithms with different libraries under same conditions (see below the results). Is this something that is expected or is there an approach that I overlooked when running the program?
The environment itself is trivial (reward is simply the sum of all the continuous actions) yet the RLlib algorithm takes a fairly long time to converge. I have tried different things like playing with learning rate and other parameters but none of them actually changed the very gradual and slow learning process.

Ubuntu 18.04
Python == 3.7.7
Ray version == 0.8.4
Tensorflow version == 1.15.0
PyTorch version == 1.4.0
System information:
i7-8750H (6 physical cores)
Nvidia Geforce 1050Ti (CUDA 10.1)
Steps to reproduce the issue:
import numpy as np
import pandas as pd
from csv import writer
import itertools
import gym
from gym import spaces
import ray
from ray.tune.registry import register_env
import ray.rllib.agents.ppo as ppo
import ray.rllib.agents.impala as impala
from ray.tune.logger import pretty_print
class TestEnv(gym.Env):
def __init__(self, dimensions = 1000, observations = 100):
self.action_space = spaces.Box(low = 0, high = 1, shape = (dimensions, ))
self.observation_space = spaces.Box(low = 0, high = 1, shape = (observations, ))
self.obs = np.random.rand(observations)
def step(self, action):
reward = np.sum(action)
observation = self.obs
done = True
return observation, reward, done, {}
def reset(self):
observation = self.obs
return observation
# Base settings
num_steps = 100
save_paths = {"impala": "./logs/impala_evaluations",
"ppo": "./logs/ppo_evaluations"}
# Initialisation
ray.init()
register_env("TestEnv", lambda config: TestEnv()) #register custom environment
for algorithm in ["ppo","impala"]:
# Initialise algos
if algorithm == "impala":
config = impala.DEFAULT_CONFIG.copy()
elif algorithm == "ppo":
config = ppo.DEFAULT_CONFIG.copy()
# Setup config to be similar to the stable baselines
config["num_gpus"] = 1
config["num_workers"] = 6
config["model"]["fcnet_hiddens"] = [64,64]
if algorithm == "impala":
trainer = impala.ImpalaTrainer(config=config, env="TestEnv")
elif algorithm == "ppo":
config["use_pytorch"] = True # Cannot align GPU
trainer = ppo.PPOTrainer(config=config, env="TestEnv")
first_save = True
log_path = save_paths[algorithm]
for i in range(num_steps):
# Perform one iteration of training the policy with PPO
result = trainer.train()
# Output results
print("Step: {}, Reward {}, Max reward: {}, Timesteps: {}, Time since start: {}".format(i,
result["episode_reward_mean"],
result["episode_reward_max"],
result["timesteps_total"],
result["time_total_s"]))
# Write csv
csv_rows = [result["timesteps_total"], result["episode_reward_mean"], result["time_total_s"]]
if first_save:
first_save = False
with open("{}.csv".format(log_path), "w") as fp:
csv_writer = writer(fp)
csv_writer.writerow(["Timesteps", "Mean_reward", "Time"])
csv_writer.writerow(csv_rows)
else:
with open("{}.csv".format(log_path), "a+") as fp:
csv_writer = writer(fp)
csv_writer.writerow(csv_rows)
if i % 100 == 0:
checkpoint = trainer.save()
print("checkpoint saved at", checkpoint)
import gym
from gym import spaces
from stable_baselines import DQN, PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines.common import set_global_seeds
# EvalCallback requirements
import numpy as np
import os
import warnings
from stable_baselines.common.callbacks import EventCallback, BaseCallback
from typing import Union, List, Dict, Any, Optional
from stable_baselines.common.vec_env import VecEnv, sync_envs_normalization, DummyVecEnv
import time
from csv import writer
from environments import ProductionEnvironment
class EvalCallback(EventCallback):
"""
Callback for evaluating an agent.
:param eval_env: (Union[gym.Env, VecEnv]) The environment used for initialization
:param callback_on_new_best: (Optional[BaseCallback]) Callback to trigger
when there is a new best model according to the `mean_reward`
:param n_eval_episodes: (int) The number of episodes to test the agent
:param eval_freq: (int) Evaluate the agent every eval_freq call of the callback.
:param log_path: (str) Path to a folder where the evaluations (`evaluations.npz`)
will be saved. It will be updated at each evaluation.
:param best_model_save_path: (str) Path to a folder where the best model
according to performance on the eval env will be saved.
:param deterministic: (bool) Whether the evaluation should
use a stochastic or deterministic actions.
:param render: (bool) Whether to render or not the environment during evaluation
:param verbose: (int)
"""
def __init__(self, eval_env: Union[gym.Env, VecEnv],
callback_on_new_best: Optional[BaseCallback] = None,
n_eval_episodes: int = 5,
eval_freq: int = 10000,
log_path: str = None,
best_model_save_path: str = None,
deterministic: bool = True,
render: bool = False,
verbose: int = 1):
super(EvalCallback, self).__init__(callback_on_new_best, verbose=verbose)
self.n_eval_episodes = n_eval_episodes
self.eval_freq = eval_freq
self.best_mean_reward = -np.inf
self.last_mean_reward = -np.inf
self.deterministic = deterministic
self.render = render
self.start_time = time.time()
self.results = []
self.first_save = True
# Convert to VecEnv for consistency
if not isinstance(eval_env, VecEnv):
eval_env = DummyVecEnv([lambda: eval_env])
assert eval_env.num_envs == 1, "You must pass only one environment for evaluation"
self.eval_env = eval_env
self.best_model_save_path = best_model_save_path
# Logs will be written in `evaluations.npz`
if log_path is not None:
log_path = os.path.join(log_path, 'evaluations')
self.log_path = log_path
self.evaluations_results = []
self.evaluations_timesteps = []
self.evaluations_length = []
def _init_callback(self):
# Does not work in some corner cases, where the wrapper is not the same
if not type(self.training_env) is type(self.eval_env):
warnings.warn("Training and eval env are not of the same type"
"{} != {}".format(self.training_env, self.eval_env))
# Create folders if needed
if self.best_model_save_path is not None:
os.makedirs(self.best_model_save_path, exist_ok=True)
if self.log_path is not None:
os.makedirs(os.path.dirname(self.log_path), exist_ok=True)
def _on_step(self) -> bool:
if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
# Sync training and eval env if there is VecNormalize
sync_envs_normalization(self.training_env, self.eval_env)
episode_rewards, episode_lengths = evaluate_policy(self.model, self.eval_env,
n_eval_episodes=self.n_eval_episodes,
render=self.render,
deterministic=self.deterministic,
return_episode_rewards=True)
if self.log_path is not None:
self.evaluations_timesteps.append(self.num_timesteps)
self.evaluations_results.append(episode_rewards)
self.evaluations_length.append(episode_lengths)
np.savez(self.log_path, timesteps=self.evaluations_timesteps,
results=self.evaluations_results, ep_lengths=self.evaluations_length)
mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)
mean_ep_length, std_ep_length = np.mean(episode_lengths), np.std(episode_lengths)
# Keep track of the last evaluation, useful for classes that derive from this callback
self.last_mean_reward = mean_reward
# Write to csv file
time_since_start = time.time() - self.start_time
csv_rows = [self.num_timesteps, mean_reward, std_reward, time_since_start]
if self.first_save:
self.first_save = False
with open("{}.csv".format(self.log_path), "w") as fp:
csv_writer = writer(fp)
csv_writer.writerow(["Timesteps", "Mean_reward", "Std_reward", "Time"])
csv_writer.writerow(csv_rows)
else:
with open("{}.csv".format(self.log_path), "a+") as fp:
csv_writer = writer(fp)
csv_writer.writerow(csv_rows)
if self.verbose > 0:
print("Eval num_timesteps={}, "
"episode_reward={:.2f} +/- {:.2f}, "
"Episode length: {:.2f} +/- {:.2f}, "
"Time since start: {:.0f}".format(self.num_timesteps, mean_reward, std_reward, mean_ep_length, std_ep_length, time.time() - self.start_time))
if mean_reward > self.best_mean_reward:
if self.verbose > 0:
print("New best mean reward!")
if self.best_model_save_path is not None:
self.model.save(os.path.join(self.best_model_save_path, 'best_model'))
self.best_mean_reward = mean_reward
# Trigger callback if needed
if self.callback is not None:
return self._on_event()
return True
def make_env(base_env):
"""
Utility function for multiprocessed env.
:param env_id: (str) the environment ID
:param num_env: (int) the number of environment you wish to have in subprocesses
:param seed: (int) the inital seed for RNG
:param rank: (int) index of the subprocess
"""
def _init():
env = base_env()
return env
return _init
class TestEnv(gym.Env):
def __init__(self, dimensions = 1000, observations = 100):
self.action_space = spaces.Box(low = 0, high = 1, shape = (dimensions, ))
self.observation_space = spaces.Box(low = 0, high = 1, shape = (observations, ))
self.obs = np.random.rand(observations)
def step(self, action):
reward = np.sum(action)
observation = self.obs
done = True
return observation, reward, done, {}
def reset(self):
observation = self.obs
return observation
if __name__ == "__main__":
num_cpu = 6
# Create environment
env = SubprocVecEnv([make_env(TestEnv) for i in range(num_cpu)])
# Instantiate the agent
model = PPO2('MlpPolicy', env, verbose=0)
# Evaluate the agent
eval_env = TestEnv()
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/',
log_path='./logs/', eval_freq=1000,
deterministic=True, render=False, verbose = True)
model.learn(total_timesteps=int(1e8), callback=eval_callback)
I can't answer your question, but could you try plotting by number of timesteps on the x-axis? Maybe sampling in ray is slower than it is in other libraries
Also - try reducing "num_sgd_iter" for PPO (the default is 30 which seems high), and try PPO with Tensorflow..
@matej-macak thanks for filing this! It seems to be simply default hyperparam related.
1) Could you try setting the kl_coeff in your config to 0.0?
2) Also, it's probably better to use no gamma at all since it is a context-less env (set gamma to 0.0).
Please let us know, whether this works.
Hi all,
thank you very much - really amazing community for such a quick set of answers. Here are the tests that I have run:
In this test, I have set the kl_coeff to 0.0 and gamma to 0.0. Agreed that this is a context-less environment but wanted to use like for like (as the stable-baselines. Definitely helped with convergence although still a bit slower. This got me going on a good trajectory though, I could probably use tune to find a set of coefficients that could increase this even further. Is there any reason why the
I set the num_sgd_iter to 15 (not sure what the right one should be in this case).

@regproj - Yes I have noticed that the timesteps are lower than in the stable-baselines case but what I am interested in is the wall time not the timestep efficiency (i.e. for the same level of resources what is the speed I want to get).
Overall, I find ray amazing and it definitely outperforms stable-baselines on many atari benchmarks I have tested. As the problem I am trying to solve is more similar to the TestEnv class, I wanted to solve this toy example before deploying cluster using this.
Hey @matej-macak , actually, yeah, I noticed too a very slow learning convergence having to do with the (continuous) action space being bounded for PPO. In the case of bounded cont. action, we simply clip the output actions before sending them to the env. I'll look into this further.
Does the action distribution for ppo (or ddpg, sac, etc.) start off as a unit gaussian? If so, would it be better for convergence if we set the environment action bounds to something like -1,1 and rescaled it inside the environment?
I have noticed that part of the slowness can be explained with the parameters tuning (i.e. the system is very sensitive to train_batch_size, num_sgd_iter, rollout_fragment_length and sgd_minibatch_size. I am assuming that given these are deterministic and one step environments it is better to not have a final update batch size larger than the number of workers as the batch probably contains a high number of repetitive actions and steps which leads to slower training.
Lowering these parameters, however, leads to a quicker memory leak and crash which I reported here in #8473
I think I found a quite satisfying hyperparam solution, which converges quite fast now (within <1min). There will be a PR today that also gets rid of the Box-limit problem in our PPO (Box(0.0, 1.0) learns ok, but e.g. Box(1.0, 3.0) doesn't).
...
Here is the config, that works quite well on my end now. It basically simplifies everything a lot. But then again, it's also a very simple env.
config = {
"num_workers": 0,
"entropy_coeff": 0.00001,
"num_sgd_iter": 4,
"vf_loss_coeff": 0.0,
#"vf_clip_param": 100.0,
#"grad_clip": 1.0,
"lr": 0.0005,
# State doesn't matter -> Set both gamma and lambda to 0.0.
"lambda": 0.0,
"gamma": 0.0,
"clip_param": 0.1,
"kl_coeff": 0.0,
"train_batch_size": 64,
"sgd_minibatch_size": 16,
"normalize_actions": True,
"clip_actions": False,
# Use a very simple Model for faster convergence.
"model": {
"fcnet_hiddens": [8],
},
"use_pytorch": [True/False],
}
Closing this issue.
On another note: It could also be that baselines treats the action space as type=int ... in which case it would be easier to reach the 1000 reward, b/c you only had two choices (0 and 1) for each action component.
Hi @sven1977, the action_space is of type==float. I have checked the action vector in the step and it is not producing binary results. I think the hyperparameter search definitely helps. I have been working on a similar, more complex, problem that inspired this question and my finding is that rllib seems to have a more stable but slower convergence than stable baselines that is faster but can get stuck in local minima a bit more easily.
Most helpful comment
Hey @matej-macak , actually, yeah, I noticed too a very slow learning convergence having to do with the (continuous) action space being bounded for PPO. In the case of bounded cont. action, we simply clip the output actions before sending them to the env. I'll look into this further.