I have built a custom environment by implementing the step, reset and render methods from StableBaselines but I don't know how to print some plots in order to know how the model is behaving.
I found about results_plotterbut couldn't find many info of it.
results_plotter.plot_results(["."], 10e6, results_plotter.X_TIMESTEPS, "Market rewards")
More in to detail, my training looks like this:
def evaluate(model, num_episodes=100):
"""
Evaluate a RL agent
:param model: (BaseRLModel object) the RL Agent
:param num_episodes: (int) number of episodes to evaluate it
:return: (float) Mean reward for the last num_episodes
"""
# This function will only work for a single Environment
env = model.get_env()
all_episode_rewards = []
for i in range(num_episodes):
episode_rewards = []
done = False
obs = env.reset()
states = model.initial_state # get the initial state vector for the reccurent network
while not done:
# _states are only useful when using LSTM policies
action, _states = model.predict(obs, states)
# here, action, rewards and dones are arrays
# because we are using vectorized env
obs, reward, done, info = env.step(action)
episode_rewards.append(reward)
all_episode_rewards.append(sum(episode_rewards))
mean_episode_reward = np.mean(all_episode_rewards)
print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)
return mean_episode_reward
env = CustomTradingEnvironment(stock_rates, client_amounts, client_actions)
env = Monitor(env, filename='CustomTrading.log', allow_early_resets=True)
# The algorithms require a vectorized environment to run
env = DummyVecEnv([lambda: env])
model_a2c = A2C('MlpPolicy', env, gamma = gam, verbose=1)
model_a2c.learn(total_timesteps=len(spot_rates.columns)-1)
And then I evaluate the model like so.
evaluate(model_a2c)
I don't understand from the docs how I can plot the aforementioned metrics.
There is no pre-made tool for this at the moment. Your best bet is to create a Wrapper for recording this kind of information. Take a look at the Monitor wrapper and how it tracks the episodic rewards. Tracking state-action-reward pairs should be a trivial change. Note that this will do the tracking per environment, not per agent.
There is no pre-made tool for this at the moment. Your best bet is to create a Wrapper for recording this kind of information. Take a look at the Monitor wrapper and how it tracks the episodic rewards. Tracking state-action-reward pairs should be a trivial change. Note that this will do the tracking per environment, not per agent.
This is a bit weird.
For instance, all the metrics that are printed while training if verbose =1, are a good indicator to begin with, but they are not easily parsable.
Depending on the algorithm, their own log-prints (with the verbose), might be enough for the values you want to track. However these prints depend on the algorithm and not all of them are as throughout.
If you want per-step logging (as I understood from your initial comment), doing a new Wrapper provides the best access. I recommend taking a look at using Tensorboard too, as it might provide the info you need.
@Miffyli is right, gym Wrapper is the way to go. You can learn more about them in our rl tutorial ;)