In an example, I was using PPO2 to solve cart pole.
I am using evaluate_policy to evaluate my trained policy before visualizing it and noticed that the mean reward calculated by evaluate_policy is consistently and considerably higher than what the same trained agent achieves when visualizing it.
See the example here:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.bench import Monitor
# repeat 3 times to validate
for rep in range(3):
print(f"\nRepetition {rep}")
env = gym.make('CartPole-v1')
model = PPO2(MlpPolicy, Monitor(env, filename=f'logs/CartPole-v1/PPO2/'), verbose=0).learn(10000)
# evaluate
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Eval reward: {mean_reward} (+/-{std_reward})")
# test and visualize
obs = env.reset()
for i in range(500):
action, _states = model.predict(obs)
obs, rewards, done, info = env.step(action)
if done:
print(f"Failed after {i} steps.")
break
# env.render()
Running this produced:
Repetition 0
Eval reward: 500.0 (+/-0.0)
Failed after 232 steps.
Repetition 1
Eval reward: 378.6 (+/-122.41829928568686)
Failed after 152 steps.
Repetition 2
Eval reward: 457.0 (+/-60.59372904847498)
Failed after 27 steps.
As you can see, there is a huge gap between the evaluation reward and the achieved reward when manually testing it afterwards.
Of course, I don't expect it to be equal to the evaluation mean reward. But I repeated this multiple times, and the evaluation reward is always much higher (never lower).
Did I miss something in the documentation? Or what's the reason for this big gap? I expected
I'm using Python 3.6, stable-baselines 2.10.0, Windows 10.
Hello,
As mentioned in the documentation, try to use determinsitic=True (this is the default for evaluate_policy).
Ah, so using model.predict(obs, deterministic=True)?
Indeed, now the results are what I expected. Thanks!
I was trying to follow the docs and used the example here, testing it with PPO2: https://stable-baselines.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading
I now saw, that DQN uses deterministic=True and PPO2 deterministic=False by default.
I don't quite get what deterministic=True does. The docs say "Whether or not to return deterministic actions."
But even model.predict(obs, deterministic=False) seems to return deterministic actions (either 0 or 1). Still the results and achieved reward is lower with deterministic=False. Could you explain or point to the corresponding place in the docs? Thanks!
When/why would I set deterministic=False?
For continuous actions, it returns the mean of the gaussian, from which actions are sampled in the stochastic case. I.e. rather than sampling actions, and having possibly different actions for same observation, we take same action for same observation always.
In your case, it is probably sampling actions outside [0, 1] interval, which are then clipped to [0, 1] for it to work with the environment, hence you see actions like that even with deterministic=False.
@Miffyli it seems he is using CartPole, so discrete actions (0 or 1).
EDIT: in that case the probability distribution is a Categorical one.
@araffin Ah, somehow I thought it would have been continuous, my bad!
In case of discrete actions, the deterministic=True returns the action with highest probability.
Ok, and greedily taking the action with highest probability (deterministic=True) apparently yields better results in my CartPole env than sampling and clipping them (deterministic=False). Thanks!