Stable-baselines: Mean Episode Reward and Length showing NaN in PPO2 training.

Created on 18 Sep 2018 · 10Comments · Source: hill-a/stable-baselines

Hello,

First of all I would like to thank you for this is incredibly polished package. Seeing good documentation just makes me feel happy.

While training PPO2 on Pong-ram-v0 (and several other envs) I realized that the episode reward doesn't print correctly to the console. It is seen as follows:
screenshot from 2018-09-18 12-15-51

I tried looking into it but couldn't figure out why just yet. If you give me some pointers I would like to work on fixing it.

I am using python 3.6.5 on an Ubuntu 16.04.

enhancement good first issue help wanted

Source

batu

👍3

Most helpful comment

Looking at the code again, I found that there was an EpisodeStats class in a2c utils (https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/utils.py#L428) that apparently can do the job.
It is used in ACER only. @hill-a it seems that this class can be reused, no?

Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)

@batu looking at your code, using the max episode length won't work as some environments does not have limits.

araffin on 20 Sep 2018

👍2

All 10 comments

Hi,
Looking at the code, it seems it is an expected behavior
The episode info comes from this line of code: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L404

By default, no infos is returned. A simple solution consists in wrapping the environment with Monitor.
That is this class which will add the episode infos here:
https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/bench/monitor.py#L92

So, a workaround for now would look like that (cf Monitoring Training in the docs):

import os
import time

import gym

from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.bench import Monitor


# Create unique log dir
log_dir = "/tmp/gym/{}".format(int(time.time()))
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = gym.make('CartPole-v1')
env = Monitor(env, log_dir, allow_early_resets=True)
env = DummyVecEnv([lambda: env])

PPO2('MlpPolicy', env, verbose=1).learn(1000)

A better solution would be to use the episode_reward logger that is used for tensorboard, @hill-a ?
The episode_reward is computed here: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L313

At the same time, it could be a good idea to add episode_reward to the console logger for all algorithms.
@batu feel free to submit a PR ;) (and don't hesitate to ask more question if you need precision)

araffin on 18 Sep 2018

👍2

Hey,

Using episode_reward variable should not be too hard, it is the accumulation of the reward over each environments for the current episode. To get the actual episode reward, I would recommend reusing a lot of code from stable_baselines.a2c.utils.total_episode_reward_logger.

Afterwards, you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length.

As @araffin said, dont hesitate to ask if you have any questions :).

hill-a on 19 Sep 2018

Thank you! This is definitely good enough to get me started.

I will look into it when I have the time, should be within a weeks time.

batu on 19 Sep 2018

I managed to get the average reward over multiple environments, however I am having some difficulty getting the episode length.

From what I can tell, there is no easy way of getting the length of every episode?

I worked on taking the difference between "true"s in the masks, however, for that to work I need to know what the maximum episode length (which I assume isn't available?) to handle the overflows which I think is different than n (The number of steps between the last "true" in mask[i] and the first "true" in mask[i+1] is another episode.) My abomination of an implementations currently seems to work if the length of the max_episode is less than 2 * n_steps.

That being said, I am almost sure I am over-complicating things and there is just a nice variable for episode length, somewhere.

batu on 20 Sep 2018

Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)

@batu looking at your code, using the max episode length won't work as some environments does not have limits.

araffin on 20 Sep 2018

👍2

@batu How are you doing?

araffin on 5 Oct 2018

I have a semi-working implementation, however, it is definitely not clean enough. When my current work hump passes I will go back to it and when I stop making silly mistakes submit a PR

batu on 8 Oct 2018

👍1

Hi,

maybe I misunderstand the resolution of this, but for the life of me I still can't get the average episode length for an A2C model. Can someone clarify how I can see eplenmean? Or is it only for PPO2 models?

rachlee93 on 3 Apr 2020

if you read the first answer, you need to use a Monitor wrapper, this will done automatically if you use the rl zoo.

araffin on 4 Apr 2020

@hill-a
How is the episode-reward calculated in tenorsboard? Is that calculated as same as you tell (you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length)?