Stable-baselines: Mean Episode Reward and Length showing NaN in PPO2 training.

Created on 18 Sep 2018  路  10Comments  路  Source: hill-a/stable-baselines

Hello,

First of all I would like to thank you for this is incredibly polished package. Seeing good documentation just makes me feel happy.

While training PPO2 on Pong-ram-v0 (and several other envs) I realized that the episode reward doesn't print correctly to the console. It is seen as follows:
screenshot from 2018-09-18 12-15-51

I tried looking into it but couldn't figure out why just yet. If you give me some pointers I would like to work on fixing it.

I am using python 3.6.5 on an Ubuntu 16.04.

enhancement good first issue help wanted

Most helpful comment

Looking at the code again, I found that there was an EpisodeStats class in a2c utils (https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/utils.py#L428) that apparently can do the job.
It is used in ACER only. @hill-a it seems that this class can be reused, no?

Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)

@batu looking at your code, using the max episode length won't work as some environments does not have limits.

All 10 comments

Hi,
Looking at the code, it seems it is an expected behavior
The episode info comes from this line of code: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L404

By default, no infos is returned. A simple solution consists in wrapping the environment with Monitor.
That is this class which will add the episode infos here:
https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/bench/monitor.py#L92

So, a workaround for now would look like that (cf Monitoring Training in the docs):

import os
import time

import gym

from stable_baselines import PPO2
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.bench import Monitor


# Create unique log dir
log_dir = "/tmp/gym/{}".format(int(time.time()))
os.makedirs(log_dir, exist_ok=True)

# Create and wrap the environment
env = gym.make('CartPole-v1')
env = Monitor(env, log_dir, allow_early_resets=True)
env = DummyVecEnv([lambda: env])

PPO2('MlpPolicy', env, verbose=1).learn(1000)

A better solution would be to use the episode_reward logger that is used for tensorboard, @hill-a ?
The episode_reward is computed here: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L313

At the same time, it could be a good idea to add episode_reward to the console logger for all algorithms.
@batu feel free to submit a PR ;) (and don't hesitate to ask more question if you need precision)

Hey,

Using episode_reward variable should not be too hard, it is the accumulation of the reward over each environments for the current episode. To get the actual episode reward, I would recommend reusing a lot of code from stable_baselines.a2c.utils.total_episode_reward_logger.

Afterwards, you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length.

As @araffin said, dont hesitate to ask if you have any questions :).

Thank you! This is definitely good enough to get me started.

I will look into it when I have the time, should be within a weeks time.

I managed to get the average reward over multiple environments, however I am having some difficulty getting the episode length.

From what I can tell, there is no easy way of getting the length of every episode?

I worked on taking the difference between "true"s in the masks, however, for that to work I need to know what the maximum episode length (which I assume isn't available?) to handle the overflows which I think is different than n (The number of steps between the last "true" in mask[i] and the first "true" in mask[i+1] is another episode.) My abomination of an implementations currently seems to work if the length of the max_episode is less than 2 * n_steps.

That being said, I am almost sure I am over-complicating things and there is just a nice variable for episode length, somewhere.

Looking at the code again, I found that there was an EpisodeStats class in a2c utils (https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/utils.py#L428) that apparently can do the job.
It is used in ACER only. @hill-a it seems that this class can be reused, no?

Otherwise, I recommend you taking a look at the new vec env monitor (only in openai baselines for now: https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_monitor.py)

@batu looking at your code, using the max episode length won't work as some environments does not have limits.

@batu How are you doing?

I have a semi-working implementation, however, it is definitely not clean enough. When my current work hump passes I will go back to it and when I stop making silly mistakes submit a PR

Hi,

maybe I misunderstand the resolution of this, but for the life of me I still can't get the average episode length for an A2C model. Can someone clarify how I can see eplenmean? Or is it only for PPO2 models?

if you read the first answer, you need to use a Monitor wrapper, this will done automatically if you use the rl zoo.

@hill-a
How is the episode-reward calculated in tenorsboard? Is that calculated as same as you tell (you would just need to create a list of the episode rewards, accumulate said rewards, and return average rewards + average episode length)?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthew-hsr picture matthew-hsr  路  3Comments

maystroh picture maystroh  路  3Comments

Unimax picture Unimax  路  3Comments

junhyeokahn picture junhyeokahn  路  3Comments

ktattan picture ktattan  路  3Comments