I successfully implemented PPO2 with MlpPolicy with two different custom environments I built. Now I want to extend to MlpLstmPolicy in one of my games.
I tried to understand the MlpLstmPolicy by reading the source code but it's a bit involved. So several questions:
Follow up question on this, if the answer to the second question is no, is there any way to achieve this? Concretely, I want my agent to come up with paths that are vastly different with the previous games (quantitatively measured by correlation). Implementing curiosity might seem to help, but it is not directly learning to find paths distinct from the previous games.
What role does the variable nminibatches play in training? Does it only affect the training speed?
I tried replacing MlpPolicy with MlpLstmPolicy in my game directly without changing anything, and it appears that the learning is much worse - even after many more learning steps, the reward is far worse than that learnt with MlpPolicy. Are there general tips to using MlpLstmPolicy / necessary modifications when switching from MlpPolicy to MlpLstmPolicy?
Thanks a million in advance!
1) LSTM only memorizes past _inside the single game_, it does not remember things outside that episode.
2) nminibatches specifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total and nminibatches=4, it will split samples into four minibatches of 250 elements and do parameter updates on these batches noptepochs times.
3) LSTMs are generally harder to train than non-recurrent networks (more parameters, gradients are dependent over multiple timesteps, etc etc), and the implementation here is probably not one of the best (see e.g. R2D2 paper on research on this). I would run it at least 5x longer than non-recurrent version to see when/if the learning starts to happen later.
If you feel something in docs was not clear on these questions, please point them out so we can fix these :)
Thanks a lot @Miffyli!
Do you have any possible suggestions for the follow up question too? Thanks again!
Quick idea that pops to my mind is simply extending one episode to be multiple game episodes. E.g. The episode does not end when player dies, but only after player dies e.g. 10 times. RNNs could then store memory between these games. However I do not know if this would help with your problem at all. Things like curiosity sound better if you want different paths from different games.
A possibly tangentially related paper to this could be MERLIN where the agent learns to remember where the goal is in the level, and then can navigate to it easily.
Thanks a lot @Miffyli ! I will read about the paper and see if it helps!
In addition, I am trying a very simple custom environment to test the LSTM policy, however, it fails to do what I expected should be easy for an agent with LSTM policy.
In particular, there are only two possible actions, 0 or 1. And the observation is always 0.
With a memory, the agent should easily figure out the steps required to achieve full score, but both MlpLstmPolicy and MlpLnLstmPolicy are unable to reach the optimal result (and in fact worse than simply MlpPolicy even though it's trained with more episodes.
I thought it might be because the memory only remembers the states seen, but the LSTM agent still fails to solve these after changing the observation to the previous action.
solns = [
[0] + [1],
([0] + [1]) * 2,
([0] + [1]) * 3,
[0] * 2 + [1] * 2,
[0] * 3 + [1] * 3,
[0] * 4 + [1] * 4,
]
for soln in solns:
print(f'For soln = {soln}')
class CustomEnv(gym.Env):
def __init__(self):
super(CustomEnv, self).__init__()
self.action_space = gym.spaces.Discrete(2)
self.observation_space = gym.spaces.Discrete(1)
self.step_count = 0
self.soln = soln
def step(self, action):
if action!=self.soln[self.step_count]:
done = True
reward = -10
else:
done = False
reward = 1
self.step_count+=1
if self.step_count == len(self.soln):
done = True
reward = 100
return 0, reward, done, {}
def reset(self):
self.step_count = 0
return 0
def render(self, mode='human'):
pass
def close (self):
pass
env0 = CustomEnv()
env = DummyVecEnv([lambda: env0]) # The algorithms require a vectorized environment to run
mlp_model = PPO2(MlpPolicy, env)
lstm_model = PPO2(MlpLstmPolicy, env, nminibatches=1)
mlp_model.set_env(env)
lstm_model.set_env(env)
def evaluate(model, num_games=100):
"""
For DummyVecEnv only.
"""
episode_rewards = [0.0]
obs = env.reset()
all_rewards = []
for i in range(num_games):
temp_reward = 0
dones = [False]
while dones[0]==False:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
temp_reward += rewards[0]
obs = env.reset()
all_rewards.append(temp_reward)
# print("Mean reward:", np.mean(np.array(all_rewards)))
return np.mean(np.array(all_rewards))
mlp_mean_reward = []
for i in range(100):
mlp_model.learn(total_timesteps=200)
mlp_mean_reward.append(evaluate(mlp_model, num_games=50))
print(f'For mlp, mean reward is {np.mean(np.array(mlp_mean_reward[-50:])):.2f}')
plt.plot(mlp_mean_reward)
plt.show()
lstm_mean_reward = []
for i in range(1000):
lstm_model.learn(total_timesteps=200)
lstm_mean_reward.append(evaluate(lstm_model, num_games=50))
print(f'For lstm, mean reward is {np.mean(np.array(lstm_mean_reward[-50:])):.2f}')
plt.plot(lstm_mean_reward)
plt.show()
Outputs:
For soln = [0, 1]
For mlp, mean reward is 99.76
For lstm, mean reward is 101.00
For soln = [0, 1, 0, 1]
For mlp, mean reward is 6.69
For lstm, mean reward is -9.00
For soln = [0, 1, 0, 1, 0, 1]
For mlp, mean reward is -5.43
For lstm, mean reward is -8.08
For soln = [0, 0, 1, 1]
For mlp, mean reward is 7.98
For lstm, mean reward is -7.96
For soln = [0, 0, 0, 1, 1, 1]
For mlp, mean reward is -7.00
For lstm, mean reward is -7.00
For soln = [0, 0, 0, 0, 1, 1, 1, 1]
For mlp, mean reward is -6.01
For lstm, mean reward is -6.00
Am I missing something trivial?
Looking at your code it seems you do not run LSTM policies longer, but you repeat the experiment 10x more than with MLPPolicy. Try training both policies for e.g. 1000 or even 10000 steps to see if that makes a difference. I would also try bit more classical recurrent-testing environment where the agent has to recall the value seen N steps ago. It _should_ be able to solve that one eventually with enough training, but with all the nuisance-factors of RL algorithms it may have hard time doing so. As far as I know the current LSTM implementation has been useful in robotics/locomotion tasks.
I made the game easier, i.e. select the observation 2 steps ago, so that there is feedback to reach the solution (reward = -np.abs(action - self.soln[self.step_count - 2])). I also trained the models 5000 steps instead. However, the LSTM versions still do not outperform the plain MLP version.
I also corrected the evaluation code for LSTM, feeding in the internal state to the agent (It may be a good idea to add some documentation to the code - I only figured out that is needed after reading through the underlying code).
I tried both PPO2 and A2C as the agent but the results are the same.
I must be missing something and I'd really appreciate some help / insight.
Thanks in advance!
action_space_size = 10
observation_space_size = 10
soln_length = 1000
class CustomEnv(gym.Env):
def __init__(self):
self.action_space = gym.spaces.Discrete(action_space_size)
self.observation_space = gym.spaces.Discrete(observation_space_size)
self.step_count = 0
def step(self, action, done = False):
if self.step_count < 2:
reward = 0
else: # step_count>=2
reward = -np.abs(action - self.soln[self.step_count - 2])
self.step_count+=1
if self.step_count == soln_length-1:
done = True
return self.soln[self.step_count], reward, done, {}
def reset(self):
self.step_count = 0
self.soln = np.random.randint(action_space_size, size = soln_length)
return self.soln[0]
env0 = CustomEnv()
env = DummyVecEnv([lambda: env0]) # The algorithms require a vectorized environment to run
mlp_model = PPO2(MlpPolicy, env)
lstm_model = PPO2(MlpLstmPolicy, env, nminibatches=1)
mlp_model.set_env(env)
lstm_model.set_env(env)
def evaluate(model, num_games=100):
"""
For DummyVecEnv only.
"""
episode_rewards = [0.0]
obs = env.reset()
all_rewards = []
for i in range(num_games):
temp_reward = 0
dones = [False]
while dones[0]==False:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
temp_reward += rewards[0]
obs = env.reset()
all_rewards.append(temp_reward)
return np.mean(np.array(all_rewards))
def evaluate_ltsm(model, num_games=100):
"""
For DummyVecEnv, LSTM only.
"""
episode_rewards = [0.0]
obs = env.reset()
all_rewards = []
for i in range(num_games):
_states = None
temp_reward = 0
dones = [False]
while dones[0]==False:
action, _states = model.predict(obs,
state=_states,
mask=dones)
obs, rewards, dones, info = env.step(action)
temp_reward += rewards[0]
obs = env.reset()
all_rewards.append(temp_reward)
return np.mean(np.array(all_rewards))
mlp_mean_reward = []
for i in range(100):
mlp_model.learn(total_timesteps=soln_length*5)
mlp_mean_reward.append(evaluate(mlp_model, num_games=50))
print(f'For mlp, mean reward is {np.mean(np.array(mlp_mean_reward[-50:])):.2f}')
lstm_mean_reward = []
for i in range(100):
print(i)
lstm_model.learn(total_timesteps=soln_length*5)
lstm_mean_reward.append(evaluate_ltsm(lstm_model, num_games=50))
print(f'For lstm, mean reward is {np.mean(np.array(lstm_mean_reward[-50:])):.2f}')
plt.plot(lstm_mean_reward)
plt.show()
lnlstm_mean_reward = []
for i in range(100):
lnlstm_model.learn(total_timesteps=soln_length*5)
lnlstm_mean_reward.append(evaluate_ltsm(lnlstm_model, num_games=50))
print(f'For lnlstm, mean reward is {np.mean(np.array(lnlstm_mean_reward[-50:])):.2f}')
plt.plot(lnlstm_mean_reward)
plt.show()
Output:
For mlp, mean reward is -2494.52
For lstm, mean reward is -2490.98
For lnlstm, mean reward is -2492.31
Or, more simply, does anyone have any example code of applying MlpLstmPolicy to custom environment that I may refer to? I feel that I must be missing something trivial...
Hmm looking over the code it does seem right, sorry for not spotting the lack of states in the first one ^^'. One last thing I would try is even longer training, but apart from that I am out of ideas.
@araffin
RNNs in RL are not my fort茅. Is this expected behavior, and/or should LSTMs be tested in a different environment?
Is this expected behavior, and/or should LSTMs be tested in a different environment?
I don't have much time to invest into that issue but if I get it right, there is a question on how to get better performances with LSTM?
My answer would be hyperparameter optimization (cf doc).
Otherwise, I don't have much experience in RL + RNN (when I used RNN, it was mostly when doing NLP), usually stacking frames (history of observations) + MLP is enough...
@araffin Noted. Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment. I hazard a guess the issue lies in the problems brought when you throw together RNNs and RL algorithms
@matthew-hsr
I know the LSTM implementation has improved performance in some RL environments (video games), so the problem may lie in hyperparameter tuning. You could look at rl-zoo for parameters and hints that could work (afaik it does not include LSTM runs). Edit: See the reply below this one.
Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment.
@Miffyli
We have a test for that ;)
https://github.com/hill-a/stable-baselines/blob/master/tests/test_lstm_policy.py#L43 (see PR https://github.com/hill-a/stable-baselines/pull/244)
@matthew-hsr @araffin @Miffyli does the observation space need to contain information about past time steps when using any of the policies with an Lstm? For instance, in the example by @matthew-hsr, the observation space only contains 1 time step of information, i.e. the state at time t. Now, I am wondering if instead it should contain e.g. the last 10 time steps of information, in this context a vector with values from t-10 to t.
When training an LSTM in e.g. Keras, the input X generally takes the shape [number of observations, number of time steps per observation, number of features per observation]. In other words, every observation is a matrix with dimensions [number of time steps per observation, number of features per observation], meaning that every observation contains several time steps of information. Hence, I am wondering if this applies to the context of RL as well or if, instead, every observation should only contain 1 time step of information.
I was not sure if I should create a new issue for this question (in addition to this one here and this one : https://github.com/hill-a/stable-baselines/issues/667). Thought it might fit here because there is an example already provided. In any case, it would be great to have some information in the documentation about how the observation space for recurrent policies should look like.
The only maintainer that worked a bit with LSTM is @erniejunior , see https://github.com/hill-a/stable-baselines/issues/278 and https://github.com/hill-a/stable-baselines/issues/158
Looking at all the issues about LSTM: https://github.com/hill-a/stable-baselines/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+LSTM I think we need to gather them and close the duplicated ones.
Most helpful comment
1) LSTM only memorizes past _inside the single game_, it does not remember things outside that episode.
2)
nminibatchesspecifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total andnminibatches=4, it will split samples into four minibatches of250elements and do parameter updates on these batchesnoptepochstimes.3) LSTMs are generally harder to train than non-recurrent networks (more parameters, gradients are dependent over multiple timesteps, etc etc), and the implementation here is probably not one of the best (see e.g. R2D2 paper on research on this). I would run it at least 5x longer than non-recurrent version to see when/if the learning starts to happen later.
If you feel something in docs was not clear on these questions, please point them out so we can fix these :)