Hello,
First of all I would like to thank you for this is incredibly polished package. Seeing good documentation just makes me feel happy.
While training PPO2 on Pong-ram-v0 (and several other envs) I realized that the episode reward doesn't print correctly to the console. It is seen as follows:

I tried looking into it but couldn't figure out why just yet. If you give me some pointers I would like to work on fixing it.
I am using python 3.6.5 on an Ubuntu 16.04.
Hi,
Looking at the code, it seems it is an expected behavior
The episode info comes from this line of code: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L404
By default, no infos is returned. A simple solution consists in wrapping the environment with Monitor.
That is this class which will add the episode infos here:
https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/bench/monitor.py#L92
So, a workaround for now would look like that (cf Monitoring Training in the docs):
import os
import time
import gym
from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.bench import Monitor
# Create unique log dir
log_dir = "/tmp/gym/{}".format(int(time.time()))
os.makedirs(log_dir, exist_ok=True)
# Create and wrap the environment
env = gym.make('CartPole-v1')
env = Monitor(env, log_dir, allow_early_resets=True)
env = DummyVecEnv([lambda: env])
PPO2('MlpPolicy', env, verbose=1).learn(1000)
A better solution would be to use the episode_reward logger that is used for tensorboard, @hill-a ?
The episode_reward is computed here: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L313
At the same time, it could be a good idea to add episode_reward to the console logger for all algorithms.
@batu feel free to submit a PR ;) (and don't hesitate to ask more question if you need precision)
Hey,
Using episode_reward variable should not be too hard, it is the accumulation of the reward over each environments for the current episode. To get the actual episode reward, I would recommend reusing a lot of code from stable_baselines.a2c.utils.total_episode_reward_logger.
Afterwards, you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length.
As @araffin said, dont hesitate to ask if you have any questions :).
Thank you! This is definitely good enough to get me started.
I will look into it when I have the time, should be within a weeks time.
I managed to get the average reward over multiple environments, however I am having some difficulty getting the episode length.
From what I can tell, there is no easy way of getting the length of every episode?
I worked on taking the difference between "true"s in the masks, however, for that to work I need to know what the maximum episode length (which I assume isn't available?) to handle the overflows which I think is different than n (The number of steps between the last "true" in mask[i] and the first "true" in mask[i+1] is another episode.) My abomination of an implementations currently seems to work if the length of the max_episode is less than 2 * n_steps.
That being said, I am almost sure I am over-complicating things and there is just a nice variable for episode length, somewhere.
Looking at the code again, I found that there was an EpisodeStats class in a2c utils (https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/utils.py#L428) that apparently can do the job.
It is used in ACER only. @hill-a it seems that this class can be reused, no?
Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)
@batu looking at your code, using the max episode length won't work as some environments does not have limits.
@batu How are you doing?
I have a semi-working implementation, however, it is definitely not clean enough. When my current work hump passes I will go back to it and when I stop making silly mistakes submit a PR
Hi,
maybe I misunderstand the resolution of this, but for the life of me I still can't get the average episode length for an A2C model. Can someone clarify how I can see eplenmean? Or is it only for PPO2 models?
if you read the first answer, you need to use a Monitor wrapper, this will done automatically if you use the rl zoo.
@hill-a
How is the episode-reward calculated in tenorsboard? Is that calculated as same as you tell (you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length)?
Most helpful comment
Looking at the code again, I found that there was an
EpisodeStatsclass in a2c utils (https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/utils.py#L428) that apparently can do the job.It is used in ACER only. @hill-a it seems that this class can be reused, no?
Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)
@batu looking at your code, using the max episode length won't work as some environments does not have limits.