Stable-baselines: Plotting a trained agent

Created on 24 Jan 2020 · 4Comments · Source: hill-a/stable-baselines

I have built a custom environment by implementing the step, reset and render methods from StableBaselines but I don't know how to print some plots in order to know how the model is behaving.

For instance, how many times my agent (in a Discrete action space) took action = 0, 1, 2, etc...
What signal did the environment give.
How the rewards moved in each timestep

I found about results_plotterbut couldn't find many info of it.

results_plotter.plot_results(["."], 10e6, results_plotter.X_TIMESTEPS, "Market rewards")

More in to detail, my training looks like this:

def evaluate(model, num_episodes=100):
    """
    Evaluate a RL agent
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    # This function will only work for a single Environment
    env = model.get_env()
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = env.reset()
        states = model.initial_state  # get the initial state vector for the reccurent network
        while not done:
            # _states are only useful when using LSTM policies
            action, _states = model.predict(obs, states)
            # here, action, rewards and dones are arrays
            # because we are using vectorized env
            obs, reward, done, info = env.step(action)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

env = CustomTradingEnvironment(stock_rates, client_amounts, client_actions)
env = Monitor(env, filename='CustomTrading.log', allow_early_resets=True)
# The algorithms require a vectorized environment to run
env = DummyVecEnv([lambda: env])

model_a2c = A2C('MlpPolicy', env, gamma = gam, verbose=1)
model_a2c.learn(total_timesteps=len(spot_rates.columns)-1)

And then I evaluate the model like so.

evaluate(model_a2c)

I don't understand from the docs how I can plot the aforementioned metrics.

question

Source

ppanteliadis

All 4 comments

There is no pre-made tool for this at the moment. Your best bet is to create a Wrapper for recording this kind of information. Take a look at the Monitor wrapper and how it tracks the episodic rewards. Tracking state-action-reward pairs should be a trivial change. Note that this will do the tracking per environment, not per agent.

Miffyli on 24 Jan 2020

👍1

There is no pre-made tool for this at the moment. Your best bet is to create a Wrapper for recording this kind of information. Take a look at the Monitor wrapper and how it tracks the episodic rewards. Tracking state-action-reward pairs should be a trivial change. Note that this will do the tracking per environment, not per agent.

This is a bit weird.
For instance, all the metrics that are printed while training if verbose =1, are a good indicator to begin with, but they are not easily parsable.

ppanteliadis on 24 Jan 2020

Depending on the algorithm, their own log-prints (with the verbose), might be enough for the values you want to track. However these prints depend on the algorithm and not all of them are as throughout.

If you want per-step logging (as I understood from your initial comment), doing a new Wrapper provides the best access. I recommend taking a look at using Tensorboard too, as it might provide the info you need.

Miffyli on 24 Jan 2020

@Miffyli is right, gym Wrapper is the way to go. You can learn more about them in our rl tutorial ;)